A diver breathing monitoring method and system for a closed circuit rebreather

By using a microphone array and generative adversarial network in a closed-circuit respirator, the problems of high engineering cost, heavy wearing burden and insufficient accuracy in respiratory monitoring in the prior art are solved, and a respiratory monitoring effect with high signal-to-noise ratio and high dynamic range is achieved.

CN122201343APending Publication Date: 2026-06-12BEIJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING UNIV OF POSTS & TELECOMM
Filing Date
2026-03-09
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies for respiratory monitoring in closed-circuit respirators suffer from problems such as high engineering costs, heavy wearing burden, and insufficient accuracy. In particular, signal transmission is limited in underwater environments, making it difficult to achieve efficient and accurate respiratory monitoring.

Method used

A microphone array deployed in a CO2 scrubber is used to achieve accurate extraction and monitoring of respiratory audio signals through multi-channel information fusion and generative adversarial networks. The process includes steps such as microphone array design, signal quality scoring, feature mapping, filtering and connected component merging, and generative adversarial network reconstruction.

🎯Benefits of technology

It effectively avoids the noise interference introduced by insufficient deployment location accuracy and limb movements, and achieves respiratory monitoring with high signal-to-noise ratio and high dynamic range, thereby improving the accuracy and reliability of respiratory monitoring.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201343A_ABST
    Figure CN122201343A_ABST
Patent Text Reader

Abstract

The application provides a diver breathing monitoring method and system for a closed-circuit breathing apparatus, and the steps of the method include: collecting audio data based on microphones in a microphone array, the audio data of each audio channel corresponding to a microphone includes a plurality of data frame signals, and a data frame signal perception quality score is calculated for the data frame signals of each audio channel; a spatial attention weight is calculated based on the data frame signal perception quality scores of the data frame signals of the same frame of each audio channel, the data frame signals of the same frame of each audio channel are fused based on the spatial attention weight, and a fused frame signal is obtained; a short-time Fourier transform is performed on the fused frame signal to obtain a first STFT amplitude spectrum, the first STFT amplitude spectrum is mapped to a Mel frequency domain based on the first STFT amplitude spectrum to obtain an initial Mel spectrum; a time domain masking signal is constructed based on the initial Mel spectrum; and the time domain masking signal is input into a preset generative adversarial network, and a breathing monitoring result is determined based on the generative adversarial network.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of respiratory monitoring technology, and in particular to a method and system for monitoring the respiratory function of divers using closed-circuit respirators. Background Technology

[0002] With the increasing frequency of underwater missions in specialized fields such as technical diving, deep-sea scientific research, and military special operations, the number of professionals in these fields has grown significantly. In such missions, closed-circuit respirators (CCRs) have become the preferred choice for professionals due to their excellent endurance and stealth capabilities.

[0003] Current research focuses on using wearable sensors to capture various physiological signals from the human respiratory system. PPG-based wearable devices attempt to infer respiratory rate from surface physiological signals, while flexible strain sensor-based solutions capture chest and abdominal movement characteristics during respiration. However, these solutions typically require highly customized skin-mounted electrodes or specialized diving suits with integrated sensors, resulting in high engineering costs and limited widespread adoption. Non-contact radio frequency technologies, such as millimeter-wave radar, which have performed well in terrestrial settings in recent years, cannot achieve effective signal transmission underwater due to the extremely high absorption and attenuation characteristics of high-frequency electromagnetic waves in water. Therefore, some work has shifted to acoustic monitoring solutions, attempting to extract tracheal sounds using microphones attached to the throat. However, such technologies introduce additional wearing burden and discomfort, are highly sensitive to deployment location, and lack sufficient accuracy in respiratory monitoring. Summary of the Invention

[0004] In view of this, embodiments of the present invention provide a method for monitoring the breathing of divers using closed-circuit respirators, in order to eliminate or improve one or more defects existing in the prior art.

[0005] One aspect of the present invention provides a method for monitoring the breathing of divers using a closed-circuit respirator, the method being based on a microphone array deployed in a CO2 scrubber, the method comprising the steps of: Audio data is collected from microphones in a microphone array. The audio data of each microphone's corresponding audio channel includes multiple data frame signals. For each audio channel's data frame signal, a perceptual quality score is calculated. Spatial attention weights are calculated based on the perceived quality scores of the data frame signals of the same frame from each audio channel. The data frame signals of the same frame from each audio channel are then fused based on the spatial attention weights to obtain a fused frame signal. The fused frame signal is subjected to a short-time Fourier transform to obtain the first STFT amplitude spectrum. The first STFT amplitude spectrum is then mapped to the Mel frequency domain to obtain the initial Mel spectrum. The filtered Mel spectrum signal is calculated based on the initial Mel spectrum. The noise residual is calculated based on the filtered Mel spectrum signal and the initial Mel spectrum. A binary mask is constructed based on the noise residual. The binary mask is applied to the initial Mel spectrum to obtain a zeroed Mel spectrum. The zeroed Mel spectrum is mapped back to the second STFT amplitude spectrum by inverse Mel transform. The second STFT amplitude spectrum is then subjected to inverse short-time Fourier transform to construct a time-domain masking signal. The time-domain masking signal is input into a preset generative adversarial network (GAN), and the respiratory monitoring result is determined based on the GAN.

[0006] The above-mentioned scheme first sets up multiple microphones, with each microphone's audio channel collecting audio. Through an alignment and fusion process, effective complementarity of multi-channel information is achieved, avoiding the inaccuracy caused by deployment location. Furthermore, the diver's body movements inevitably introduce instantaneous spike noise, which originates from the water flow impact caused by the body movements. To accurately define the time-frequency region covered by the instantaneous impact, the algorithm first performs feature extraction and region labeling based on morphological operations. This process includes three key steps: feature mapping, structured filtering, and connected component merging, obtaining a temporal masking signal. Finally, a generative adversarial network is used to supplement the masked region, further ensuring the final respiratory monitoring results.

[0007] In some embodiments of the present invention, in the step of calculating the perceived quality score of the data frame signal for each audio channel data frame signal... For each sample point of each data frame signal, calculate the normalized power spectral density; The normalized power spectral density is calculated for each sampling point of the data frame signal, the normalized spectral entropy of the data frame signal is calculated, and the amplitude saturation penalty term of the data frame signal is calculated. Based on the normalized spectral entropy and the amplitude saturation penalty term, the perceived quality score of the data frame signal of the current audio channel is calculated.

[0008] In some embodiments of the present invention, in the step of calculating the normalized power spectral density for each sampling point of each data frame signal, the normalized power spectral density is calculated using the following formula: in, Indicates audio channel The Sampling points of each data frame signal The normalized power spectral density, Indicates audio channel The The number of sampling points for each data frame signal. Indicates audio channel The Sampling points of each data frame signal The amplitude value.

[0009] In some embodiments of the present invention, in the step of calculating the normalized spectral entropy of the data frame signal based on the normalized power spectral density at each sampling point of the data frame signal, the normalized spectral entropy is calculated using the following formula: in, Indicates audio channel The The normalized spectral entropy of a data frame signal. Indicates audio channel The Sampling points of each data frame signal The normalized power spectral density, Indicates audio channel The The number of sampling points for each data frame signal.

[0010] In some embodiments of the present invention, the amplitude saturation penalty term is calculated using the following formula in the step of calculating the amplitude saturation penalty term of the data frame signal: in, Indicates audio channel The The amplitude saturation penalty term for each data frame signal. Indicates audio channel The Sampling points of each data frame signal The amplitude value, This indicates the preset acoustic overload threshold. Indicates audio channel The The number of sampling points for each data frame signal.

[0011] In some embodiments of the present invention, in the step of calculating the perceived quality score of the data frame signal of the current audio channel based on the normalized spectral entropy and amplitude saturation penalty term, the perceived quality score of the data frame signal is calculated using the following formula: in, Indicates audio channel The The perceived quality score of each data frame signal. Indicates audio channel The The amplitude saturation penalty term for each data frame signal. Indicates audio channel The The normalized spectral entropy of a data frame signal. and All of these are preset balance coefficients.

[0012] In some embodiments of the present invention, in the step of calculating the spatial attention weight based on the perceived quality score of the data frame signal of the same frame based on each audio channel, the spatial attention weight is calculated using the following formula: in, Indicates audio channel The Spatial attention weights for each data frame signal Indicates audio channel The The perceived quality score of each data frame signal. Indicates audio channel The The perceived quality score of each data frame signal. This represents the temperature coefficient.

[0013] In some embodiments of the present invention, in the step of calculating the filtered Mel spectrum signal based on the initial Mel spectrum, a horizontal closing operation is performed on the initial Mel spectrum to obtain a closing operation signal, and a vertical opening operation is performed on the closing operation signal to obtain an opening operation signal. The filtered Mel spectrum signal is then calculated based on the opening operation signal and the initial Mel spectrum.

[0014] In some embodiments of the present invention, in the step of performing a horizontal closing operation on the initial Mel spectrum to obtain a closing operation signal, performing a vertical opening operation on the closing operation signal to obtain an opening operation signal, and calculating the filtered Mel spectrum signal based on the opening operation signal and the initial Mel spectrum, the closing operation signal is calculated using the following formula: in, Indicates the first The fused frame signal corresponding to the data frame signal in the th ... A closed-loop signal with a frequency band of 1 Mel. Indicates the first The fused frame signal corresponding to the data frame signal in the th ... Transpose of the initial Mel spectrum of each Mel frequency band This represents a preset horizontal structural element. This represents the expansion operation. This represents the erosion operation; The opening operation signal is calculated using the following formula: in, Indicates the first The fused frame signal corresponding to the data frame signal in the th ... Open operation signal of 1 Mel frequency band This represents a preset vertical structural element; The filtered Mel spectrum signal is calculated using the following formula: in, Indicates the first The fused frame signal corresponding to the data frame signal in the th ... Filtered Mel spectrum signal in one Mel frequency band Indicates the first The fused frame signal corresponding to the data frame signal in the th ... Transpose of the open operation signal of 1 Mel frequency band Indicates the first The fused frame signal corresponding to the data frame signal in the th ... The maximum value in the initial Mel spectrum of each Mel band. Indicates the first The fused frame signal corresponding to the data frame signal in the th ... The maximum value among the open operation signals of the Mel frequency band.

[0015] In some embodiments of the present invention, in the step of calculating spatial attention weights based on the perceived quality scores of data frame signals of the same frame from each audio channel, and fusing the data frame signals of the same frame from each audio channel based on the spatial attention weights to obtain a fused frame signal, the optimal phase lag value for each audio channel is calculated based on the data frame signals of the reference anchor channel, and the fused frame signal is calculated based on the optimal phase lag value and the spatial attention weights.

[0016] In some embodiments of the present invention, in the step of calculating the optimal phase lag value for each audio channel based on the data frame signal of the reference anchor channel, the optimal phase lag value is calculated using the following formula: in, Indicates audio channel The The optimal phase lag value for each data frame signal Indicates the first reference anchor point channel Sampling points of each data frame signal The amplitude value, Indicates audio channel The Each data frame signal at the sampling point The amplitude value, This indicates the preset phase lag value.

[0017] In some embodiments of the present invention, in the step of calculating the fused frame signal based on the optimal phase lag value and the spatial attention weight, the amplitude value of each sampling point in the fused frame signal is calculated based on the optimal phase lag value and the spatial attention weight, and the amplitude values ​​of each sampling point are combined to obtain the fused frame signal. The amplitude value of each sampling point in the fused frame signal is calculated using the following formula: in, Indicates the first frame in the fused frame signal Sampling points of each data frame signal The amplitude value, Indicates the number of audio channels. Indicates audio channel The Spatial attention weights for each data frame signal Indicates audio channel The Each data frame signal at the sampling point The amplitude value.

[0018] In some embodiments of the present invention, the method further includes model pre-training, wherein adversarial loss, numerical precision loss and frequency domain consistency loss are calculated respectively, and the sum of adversarial loss, numerical precision loss and frequency domain consistency loss is calculated as the total loss, and the generative adversarial network is trained based on the total loss.

[0019] A second aspect of the present invention also provides a diver breathing monitoring system for a closed-circuit respirator, the system comprising a computer device including a processor and a memory, the memory storing computer instructions, the processor executing the computer instructions stored in the memory, and the system performing the steps of the method described above when the computer instructions are executed by the processor.

[0020] A third aspect of the invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the aforementioned method for monitoring the breathing of divers using a closed-circuit respirator.

[0021] Additional advantages, objects, and features of the invention will be set forth in part in the description which follows, and will also become apparent in part to those skilled in the art upon studying the text, or may be learned by practice of the invention. The objects and other advantages of the invention will become apparent from the description and the accompanying drawings.

[0022] Those skilled in the art will understand that the objectives and advantages achievable with the present invention are not limited to those specifically described above, and that the above and other objectives achievable with the present invention will become clearer from the following detailed description. Attached Figure Description

[0023] The accompanying drawings, which are provided to further illustrate the invention and form part of this application, are not intended to limit the scope of the invention.

[0024] Figure 1 This is a schematic diagram of one implementation method of the diver's breathing monitoring method for closed-circuit respirators in this solution; Figure 2 This is a schematic diagram of the overall processing architecture of this solution; Figure 3 (a) is a schematic diagram of the CCR device; Figure 3 (b) is a schematic diagram of the CCR device structure; Figure 4 (a) is a photograph of the microphone array; Figure 4 (b) is a schematic diagram of the microphone array geometry; Figure 5 Diagram showing microphone deployment locations; Figure 6 A schematic diagram of the signal waveforms received by microphones at different locations; Figure 7 This is a schematic diagram of water flow noise interference. Figure 8 To generate a schematic diagram of the adversarial network architecture; Figure 9 This is a schematic diagram of respiratory rate monitoring in the experimental case; Figure 10 This is a schematic diagram of the respiratory ratio monitoring in the experimental case; Figure 11 This is a schematic diagram of anomaly detection in an experimental example. Detailed Implementation

[0025] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and descriptions of this invention are used to explain the invention, but are not intended to limit the invention.

[0026] It should also be noted that, in order to avoid obscuring the invention with unnecessary details, only the structures and / or processing steps closely related to the solution according to the invention are shown in the accompanying drawings, while other details that are not closely related to the invention are omitted.

[0027] Introduction to existing technologies: Existing technology 1: Existing technologies rely on highly customized sensors or devices for monitoring. Typical solutions include: (1) PPG, which calculates respiratory rate by capturing changes in light signals during blood flow; and (2) SCFBSS, which uses the deformation sensing characteristics of flexible sensors to monitor respiratory movements by monitoring chest cavity movements during respiration.

[0028] Disadvantages of existing technology 1: Such technologies typically require highly customized, skin-mounted electrodes or specialized diving suits with integrated sensors, resulting in high engineering costs and limited widespread adoption. For example, PPG requires embedded, sealed wrist or forehead devices, while SCFBSS relies on customized diving suits with integrated sensors. These solutions significantly increase the engineering complexity and cost of the equipment, and their high degree of customization makes large-scale adoption difficult. Furthermore, PPG is highly susceptible to reading distortion due to insufficient perfusion to the extremities caused by the high thermal conductivity of water, while SCFBSS relies heavily on the accurate capture of chest movements, making it prone to motion artifacts under intense limb movements or water currents.

[0029] Existing technology 2: Existing technology two is based on acoustic sensing technology. A typical solution uses a piezoelectric or electret contact microphone as a sensor, which is closely attached to the diver's trachea. This solution picks up the turbulent vibration signal (i.e., tracheal sound) generated by the breathing airflow in the trachea, and after filtering and envelope extraction, it calculates the breathing frequency and intensity. Disadvantages of existing technology 2: Such technologies introduce additional wearing burden and discomfort, as the pressure required for the sensor to maintain contact can easily cause discomfort to divers. Secondly, tracheal sound monitoring is highly sensitive to deployment location; complex underwater limb movements and head rotations can cause minute displacements of the sensor, resulting in a sharp drop in the signal-to-noise ratio.

[0030] like Figure 1 As shown, this invention proposes a method for monitoring the breathing of divers using closed-circuit respirators. The method is based on a microphone array deployed in a CO2 scrubber, and the steps of the method include: Specifically, such as Figure 3As shown in (a), a closed circuit rebreather (CCR) has a fixed airflow circulation loop, and its acoustic characteristics are directly related to the gas flow rate in the loop. Figure 3 (b) illustrates the gas flow path inside the CCR. The red arrows represent the flow direction of exhaled gas, and the blue arrows represent the flow direction of inhaled gas. The CO2 scrubber removes CO2 through a chemical reaction, converting the exhaust gas into inhalable gas, thus achieving a complete airflow cycle. In this process, the mouthpiece, bellows, inhalation and exhalation regurgitation system, and CO2 scrubber together form a complete and closed airflow circuit structure. Under the mechanical constraint of the mouthpiece's one-way valve, the diver's exhaled and inhaled gas maintain a strictly unidirectional flow within the circuit.

[0031] To address the characteristics of high bandwidth and severe energy fluctuations in turbulent noise during CCR gas flow, this invention designs a four-channel distributed sampling array. For example... Figure 4 As shown in (b), unlike traditional compact beamforming arrays, this scheme adopts a rectangular distributed layout. Figure 4 As shown in (a), four high-sensitivity MEMS microphones are deployed at the four corners of the PCB board to expand the sound pickup range.

[0032] like Figure 5 As shown, the microphone array is deployed on the tank wall of the CO2 scrubber. This scheme, through a distributed layout design, utilizes the attenuation gradient of sound waves within a confined space to construct energy reception differences between channels, thereby enabling the acquisition of multi-channel information: 1. Avoid global overload: When airflow directly impacts a microphone on one side, causing acoustic overload, the natural attenuation caused by spatial distance is used to ensure that the far microphone is still in the normal operating range. 2. Ensure weak signal pickup: During the weak breathing phase, use multi-point coverage to increase the probability of capturing high signal-to-noise ratio pickup locations.

[0033] Step S100: Acquire audio data based on the microphones in the microphone array. The audio data of each microphone corresponding to the audio channel includes multiple data frame signals. Calculate the data frame signal perception quality score for each audio channel's data frame signal. Step S200: Calculate spatial attention weights based on the perceived quality scores of the data frame signals of the same frame in each audio channel; fuse the data frame signals of the same frame in each audio channel based on the spatial attention weights to obtain a fused frame signal. By adopting the above solution, Figure 6 The image shows the signal waveforms picked up by microphones at different locations within the microphone array. The area within the red box represents clipping distortion caused by sound pressure level overload. Figure 6As shown, the microphones closer to the airflow path can receive high-energy signals, but their sound pressure levels are very likely to exceed the microphone's acoustic overload point, resulting in clipping distortion and loss of key breathing frequency domain characteristics. Meanwhile, the microphones farther away from the path receive very weak signals, which are easily drowned out by noise.

[0034] To obtain breathing audio that simultaneously satisfies high dynamic range and high signal-to-noise ratio, this scheme abandons the single energy index and proposes a composite sensing quality scoring mechanism that combines frequency domain sparsity and temporal linearity to calculate the sensing quality score of data frame signals.

[0035] Step S300: Perform a short-time Fourier transform on the fused frame signal to obtain the first STFT amplitude spectrum, and map the first STFT amplitude spectrum to the Mel frequency domain to obtain the initial Mel spectrum. Step S400: Calculate the filtered Mel spectrum signal based on the initial Mel spectrum; calculate the noise residual based on the filtered Mel spectrum signal and the initial Mel spectrum; construct a binary mask based on the noise residual; apply the binary mask to the initial Mel spectrum to obtain a zeroed Mel spectrum; perform an inverse Mel transform on the zeroed Mel spectrum to map it back to the second STFT amplitude spectrum; and perform an inverse short-time Fourier transform on the second STFT amplitude spectrum to construct a time-domain masking signal. In the specific implementation process, after obtaining the filtered Mel spectrum signal after morphological filtering, in order to accurately define the noise region and output the masking interval for subsequent generative adversarial network repair, this scheme further extracts the noise residual and generates a connected component mask.

[0036] First, the difference between the original Mel spectrum and the filtered spectrum is defined as the noise residual. The non-negative components of energy are preserved and normalized. in, Indicates the first The fused frame signal corresponding to the data frame signal in the th ... The noise residuals of each Mel frequency band are used to apply an energy threshold to the normalized residuals. Perform binarization filtering to obtain the initial vertical line noise mask. : The processed mask Connectivity component labeling is performed, and the algorithm filters out time spans. And frequency domain span The region is used as effective vertical line noise. The time interval... Adjacent noise intervals are merged to obtain the time interval of noise accumulation. . , Representing the first The start and end times of each long-duration noise interval are determined. A binary mask is then obtained.

[0037] Binary mask Applying this to the Mel spectrum and setting it to zero yields the zeroed Mel spectrum. : Subsequently, This means that the Mel spectrum after being set to zero is mapped back to the linear STFT amplitude spectrum through the inverse Mel transform. Combined with the phase of the fused frame signal The time-domain masking signal is obtained through inverse short-time Fourier transform (ISTFT). : in Represents the imaginary unit. Represents the phase factor. These are the original phases of the fused frame signals. These marked intervals will be used as dynamic masking inputs for the generative adversarial network, guiding the model to perform context reconstruction.

[0038] Step S500: Input the time-domain masking signal into a preset generative adversarial network, and determine the respiratory monitoring result based on the generative adversarial network.

[0039] The system architecture of this solution is as follows: Figure 2 As shown, the system consists of three parts. The audio fusion module receives the raw multi-channel signals collected by the microphone array, performs spatial adaptive weighting and phase alignment of the multi-channel signals, and generates a single-channel audio signal with both high dynamic range and high signal-to-noise ratio. After fusion, a waveform reconstruction algorithm extracts spectral features, marks the water flow noise region based on the time-frequency morphological differences between breathing sounds and noise, and reconstructs the breathing waveform using a GAN model based on contextual information. Finally, the breathing state calculation module extracts the breathing sound envelope, calculates the breathing rate and breathing ratio, and implements a breath-holding event alarm.

[0040] The above-mentioned scheme first sets up multiple microphones, with each microphone's audio channel collecting audio. Through an alignment and fusion process, effective complementarity of multi-channel information is achieved, avoiding the inaccuracy caused by deployment location. Furthermore, the diver's body movements inevitably introduce instantaneous spike noise, which originates from the water flow impact caused by the body movements. To accurately define the time-frequency region covered by the instantaneous impact, the algorithm first performs feature extraction and region labeling based on morphological operations. This process includes three key steps: feature mapping, structured filtering, and connected component merging, obtaining a temporal masking signal. Finally, a generative adversarial network is used to supplement the masked region, further ensuring the final respiratory monitoring results.

[0041] In some embodiments of the present invention, in the step of calculating the perceived quality score of the data frame signal for each audio channel data frame signal... For each sample point of each data frame signal, calculate the normalized power spectral density; The normalized power spectral density is calculated for each sampling point of the data frame signal, the normalized spectral entropy of the data frame signal is calculated, and the amplitude saturation penalty term of the data frame signal is calculated. Based on the normalized spectral entropy and the amplitude saturation penalty term, the perceived quality score of the data frame signal of the current audio channel is calculated.

[0042] In some embodiments of the present invention, in the step of calculating the normalized power spectral density for each sampling point of each data frame signal, the normalized power spectral density is calculated using the following formula: in, Indicates audio channel The Sampling points of each data frame signal The normalized power spectral density, Indicates audio channel The The number of sampling points for each data frame signal. Indicates audio channel The Sampling points of each data frame signal The amplitude value.

[0043] In the specific implementation process This refers to the sampling point number of the power spectrum in frequency domain analysis, and its value is... The total number of sampling points after frequency domain transformation of a single frame signal, determined by the number of sampling points in a single frame. Sure: in , The sampling rate is used in this scheme. , The frame length of a single frame is used in this study. .

[0044] Using the above scheme, for the first The audio channel is in the first Frame input signal We first calculate its normalized power spectral density, which is obtained from the Fast Fourier Transform (FFT) result of the signal frame.

[0045] In some embodiments of the present invention, in the step of calculating the normalized spectral entropy of the data frame signal based on the normalized power spectral density at each sampling point of the data frame signal, the normalized spectral entropy is calculated using the following formula: in, Indicates audio channel The The normalized spectral entropy of a data frame signal. Indicates audio channel The Sampling points of each data frame signal The normalized power spectral density, Indicates audio channel The The number of sampling points for each data frame signal.

[0046] Based on the physical fact that breathing sounds have a significant harmonic structure while turbulent noise exhibits a broadband disordered distribution, this scheme introduces normalized spectral entropy. To quantify the frequency domain sparsity of the signal, when the channel captures a clear breath sound, the energy is concentrated in the fundamental frequency and its harmonic components. The spectral entropy is lower, and conversely, it tends to its maximum value when the channel is dominated by broadband turbulent noise.

[0047] In some embodiments of the present invention, the amplitude saturation penalty term is calculated using the following formula in the step of calculating the amplitude saturation penalty term of the data frame signal: in, Indicates audio channel The The amplitude saturation penalty term for each data frame signal. Indicates audio channel The Sampling points of each data frame signal The amplitude value, This indicates the preset acoustic overload threshold. Indicates audio channel The The number of sampling points for each data frame signal. It can be 0.95.

[0048] To address the clipping distortion problem that easily occurs in near-end channels, this scheme defines an amplitude saturation penalty term. This index is achieved by detecting time-domain signals that are close to physical limits (such as...). The amplitude saturation penalty term is used to quantify the degree of nonlinear distortion by the proportion of sample points of amplitude. The existence of amplitude saturation penalty term ensures that high-energy but clipped distorted signals (such signals usually have extremely high pseudo signal-to-noise ratios) will be severely penalized.

[0049] In some embodiments of the present invention, in the step of calculating the perceived quality score of the data frame signal of the current audio channel based on the normalized spectral entropy and amplitude saturation penalty term, the perceived quality score of the data frame signal is calculated using the following formula: in, Indicates audio channel The The perceived quality score of each data frame signal. Indicates audio channel The The amplitude saturation penalty term for each data frame signal. Indicates audio channel The The normalized spectral entropy of a data frame signal. and All of these are preset balance coefficients.

[0050] The above scheme simultaneously possesses a clear harmonic structure (low) And it is in the non-clipping distortion region (low) Only by using the correct channel can you get a high score.

[0051] In some embodiments of the present invention, in the step of calculating the spatial attention weight based on the perceived quality score of the data frame signal of the same frame based on each audio channel, the spatial attention weight is calculated using the following formula: in, Indicates audio channel The Spatial attention weights for each data frame signal Indicates audio channel The The perceived quality score of each data frame signal. Indicates audio channel The The perceived quality score of each data frame signal. This represents the temperature coefficient.

[0052] In some embodiments of the present invention This represents the total number of audio channels. This is the temperature coefficient. When... At that time, the algorithm selects only the single channel with the best quality. When the distribution tends to be uniform, the algorithm tends to average the contribution of each channel. This scheme selects... This gives the mechanism soft attention properties, enabling it to adaptively suppress the contribution of low-quality channels (such as clipping or high-noise channels) and focus the computation on high-fidelity channels.

[0053] Using the above scheme, based on the real-time perception scores of each channel, the system needs to dynamically determine the contribution of each sensor in the final fused signal. To avoid temporal discontinuities that may be introduced by hard handover, this scheme introduces a spatial attention mechanism to dynamically allocate the fusion weights of each channel.

[0054] In some embodiments of the present invention, in the step of calculating the filtered Mel spectrum signal based on the initial Mel spectrum, a horizontal closing operation is performed on the initial Mel spectrum to obtain a closing operation signal, and a vertical opening operation is performed on the closing operation signal to obtain an opening operation signal. The filtered Mel spectrum signal is then calculated based on the opening operation signal and the initial Mel spectrum.

[0055] like Figure 7 As shown, during the actual implementation process, the diver's body movements inevitably introduce instantaneous spike noise, which originates from the impact of water flow caused by the body movements. Considering the nonlinear characteristics of human ear frequency perception, in order to enhance the recognition of low-frequency breathing characteristics, this scheme maps the STFT amplitude spectrum to the Mel frequency domain. The Mel scale is more in line with the perceptual characteristics of human ear, and the mapping process is as follows: in, , Let be the Mel filter bank matrix, representing the th indivual Linear frequency to the Mapping weights for each Mel frequency band , This represents the number of effective frequency range points after STFT transformation. .

[0056] In some embodiments of the present invention, in the step of performing a horizontal closing operation on the initial Mel spectrum to obtain a closing operation signal, performing a vertical opening operation on the closing operation signal to obtain an opening operation signal, and calculating the filtered Mel spectrum signal based on the opening operation signal and the initial Mel spectrum, the closing operation signal is calculated using the following formula: in, Indicates the first The fused frame signal corresponding to the data frame signal in the th ... A closed-loop signal with a frequency band of 1 Mel. Indicates the first The fused frame signal corresponding to the data frame signal in the th ... Transpose of the initial Mel spectrum of each Mel frequency band This represents a preset horizontal structural element. This represents the expansion operation. This represents the erosion operation; To match the smooth characteristics and temporal continuity of horizontal bands in breath sounds, horizontal structural elements... Select an elliptical structural element with a longer horizontal dimension (time dimension), and set its size to... The elliptical edges are smoother, and the horizontal gaps can be filled by closing operations in the horizontal direction, avoiding the introduction of overly sharp artificial features, while also having a certain corrosive effect on vertical line noise; This represents the dilation operation, which uses structuring elements to perform local minima operation on the image. The center of the structuring element is traversed through each pixel of the image, and the minimum value within the area covered by the structuring element is taken as the new value of the center pixel, which can effectively reduce the target area. This represents the erosion operation, which uses structuring elements to perform local maximum operations on the image. The structuring element center is used to traverse each pixel of the image, and the maximum value within the area covered by the structuring element is taken as the new value of the center pixel. This can effectively fill the broken parts in the connected target area. During this process, the corrosion operation effectively eliminates the narrow-band transient impact in the vertical direction, while the subsequent expansion operation restores the continuity of the breathing harmonics in the horizontal direction, thereby filling the transverse gaps. The opening operation signal is calculated using the following formula: in, Indicates the first The fused frame signal corresponding to the data frame signal in the th ... Open operation signal of 1 Mel frequency band This represents a preset vertical structural element; For vertical line noise, vertical structural elements Employing an elliptical structural element with a high vertical orientation (frequency range), with dimensions set to... Compared to rectangular structural elements, elliptical structural elements in the vertical direction can avoid excessive clipping of the effective signal. The vertical opening operation first removes vertical noise using erosion, then restores the effective signal shape through dilation; this operation aims to connect broken noise bands in the vertical direction, strengthening the overall connectivity of the impulse noise. After the initial filtering is completed, to avoid signal amplitude attenuation, the filtering result will be normalized to the dynamic range of the original Mel spectrum. The filtered Mel spectrum signal is calculated using the following formula: in, Indicates the first The fused frame signal corresponding to the data frame signal in the th ... Filtered Mel spectrum signal in one Mel frequency band Indicates the first The fused frame signal corresponding to the data frame signal in the th ... Transpose of the open operation signal of 1 Mel frequency band Indicates the first The fused frame signal corresponding to the data frame signal in the th ... The maximum value in the initial Mel spectrum of each Mel band. Indicates the first The fused frame signal corresponding to the data frame signal in the th ... The maximum value among the open operation signals of the Mel frequency band.

[0057] Based on the geometric difference that the effective signal exhibits horizontal stripe characteristics in the Mel spectrum while the noise exhibits a vertical continuous structure, this scheme uses horizontal and vertical structuring elements (SEs) for morphological operations to achieve preliminary noise suppression.

[0058] In some embodiments of the present invention, in the step of calculating spatial attention weights based on the perceived quality scores of data frame signals of the same frame from each audio channel, and fusing the data frame signals of the same frame from each audio channel based on the spatial attention weights to obtain a fused frame signal, the optimal phase lag value for each audio channel is calculated based on the data frame signals of the reference anchor channel, and the fused frame signal is calculated based on the optimal phase lag value and the spatial attention weights.

[0059] In some embodiments of the present invention, in the step of calculating the optimal phase lag value for each audio channel based on the data frame signal of the reference anchor channel, the optimal phase lag value is calculated using the following formula: in, Indicates audio channel The The optimal phase lag value for each data frame signal Indicates the first reference anchor point channel Sampling points of each data frame signal The amplitude value, Indicates audio channel The Each data frame signal at the sampling point The amplitude value, This indicates the preset phase lag value.

[0060] Specifically, for subsequent phase alignment, this scheme selects the audio channel with the highest spatial attention weight as the reference anchor channel. , For phase lag value Find the maximum value. The range of values ​​is .

[0061] In the above scheme, due to the distributed layout of the microphone array, there is a slight time difference between the sound waves received by sensors at different locations. If the signals from each channel are directly weighted and superimposed, the waveforms of different phases will weaken the effective energy of the breathing sound and destroy its key high-frequency details, causing phase cancellation and amplitude-frequency response distortion of the breathing characteristics. In order to eliminate the phase difference caused by spatial distribution and make full use of the redundant information of multiple channels to repair local distortion, this scheme performs phase alignment and coherent fusion. The algorithm uses normalized cross-correlation to calculate the normalized cross-correlation between each channel and the reference anchor channel to determine the optimal phase lag.

[0062] In some embodiments of the present invention, in the step of calculating the fused frame signal based on the optimal phase lag value and the spatial attention weight, the amplitude value of each sampling point in the fused frame signal is calculated based on the optimal phase lag value and the spatial attention weight, and the amplitude values ​​of each sampling point are combined to obtain the fused frame signal. The amplitude value of each sampling point in the fused frame signal is calculated using the following formula: in, Indicates the first frame in the fused frame signal Sampling points of each data frame signal The amplitude value, Indicates the number of audio channels. Indicates audio channel The Spatial attention weights for each data frame signal Indicates audio channel The Each data frame signal at the sampling point The amplitude value.

[0063] Using the above scheme, time-domain shift compensation is performed on each non-reference channel based on the estimated optimal phase lag to ensure phase consistency with the signal under the optimal viewing angle. Finally, the signal is reconstructed through aligned weighted linear superposition. Through this alignment and fusion process, the fused signal of the frames achieves effective complementarity of multi-channel information. The waveform characteristics of the respiratory signal are coherently superimposed in the time domain, significantly enhancing the amplitude of the target signal. The final output signal successfully avoids the signal distortion and signal-to-noise ratio fluctuation problems caused by physical obstruction or sound pressure overload of a single sensor, thereby reconstructing a respiratory sound wave with high dynamic range and high signal-to-noise ratio, providing a clear and reliable data foundation for the subsequent generative repair module.

[0064] In some embodiments of the present invention, the method further includes model pre-training, wherein adversarial loss, numerical precision loss and frequency domain consistency loss are calculated respectively, and the sum of adversarial loss, numerical precision loss and frequency domain consistency loss is calculated as the total loss, and the generative adversarial network is trained based on the total loss.

[0065] like Figure 8 As shown, the model employs a generative adversarial network (GAN) architecture. The generator is based on the U-Net structure and incorporates a temporal attention mechanism to capture long-term patterns in breathing. The discriminator uses a multi-scale convolutional network to enhance the realism of waveform details. During training, this scheme uses the masked, clean audio waveform as the model input. Unlike the soft masks used in traditional speech enhancement to suppress noise, the masking concept proposed in this scheme originates from image completion in computer vision and masked language models in natural language processing. This scheme treats severely contaminated data segments as missing information and sets them to zero, forcing the model to abandon its reliance on local noise features and instead utilize uncontaminated global contextual information to reconstruct the signal.

[0066] The mask sequence divides the waveform into reserved regions ( ) and noise area ( ).for The input segment has a length of [length]. This scheme masks the middle region of the sequence, and the lengths of the beginning and end are [lengths]. The fragments are kept intact to provide context. This is because time intervals are preserved during region tagging. Noise interval merging: To better adapt the mask to real-world scenarios, this solution employs the following generation strategy. The length of a single mask fragment... Using a log-normal distribution for sampling, the probability density function (PDF) is defined as: The representative length is The probability density of the occurrence of the mask fragment. This represents the average length of the corresponding segment, determined based on statistical data. satisfy This distribution concentrates more than 70% of the segment lengths in the range of 0.2 to 0.6 seconds, while retaining longer segments in the range of 0.8 to 1.0 seconds, making it more consistent with the non-uniform characteristics of noise segments in real-world scenarios.

[0067] Spacing between adjacent mask segments Defined as the time interval between the start sampling point of the next mask segment and the end sampling point of the previous mask segment, using uniformly distributed sampling: That is, from the interval Randomly sample an integer as To avoid excessively dense or sparse segments, based on the above mask segment length... and spacing The sampling rules first generate a set of mask intervals. ( (where the total number of mask segments is ), and the first segment is... The start and end sampling points of each mask segment are: Based on mask interval set mask sequence ( The pointwise definition of ) is: in, For time-domain sampling point index, Indicates the first The start and end sampling points of each masked segment. The masked audio sequence is the element-wise product of the original normalized audio waveform and the masked sequence: in Represents element-wise product. The mask sequence generated by the above rules has a non-uniform distribution of segment length that is closer to the randomness of the distribution of real long-term noise regions. This can effectively avoid the model learning local interpolation shortcuts and force the model to extract global temporal features.

[0068] The generator employs an encoder-decoder architecture and introduces a bottleneck module specifically designed for long-time sequence completion. The encoder consists of four downsampling blocks, with the number of channels increasing layer by layer to extract high-level acoustic features. The bottleneck layer introduces a depthwise separable dilated convolution mechanism. The convolution operation is decoupled into two steps: first, each channel is independently temporally filtered by depthwise convolution, and then dilated exponentially. Greatly expands the sensory field; subsequently, by Pointwise convolutions are used to fuse information between channels. Furthermore, this scheme introduces a temporal attention module between convolutional blocks to calculate the autocorrelation of feature sequences. in Generated by convolutional layers, This represents the number of channels. This mechanism enables the model to focus on long-range contextual information before and after the masked area.

[0069] The decoder consists of four upsampling blocks that progressively restore the temporal resolution of the waveform through interpolation and convolution. The decoder introduces skip connections, concatenating shallow features from the encoder with features from the decoder to preserve the waveform's detailed texture.

[0070] To address the issues of overly smoothed reconstructed waveforms and loss of high-frequency details caused by traditional loss mechanisms, this scheme introduces a discriminator to enhance the realism of the reconstructed signal through adversarial training. Given the high sampling rate of audio signals, this scheme designs a large receptive field discriminative network based on grouped convolution. The network's input layer maps single-channel audio to 32-channel features. Subsequently, three large-kernel convolutional layers are stacked for downsampling. This design can rapidly reduce the temporal dimension while capturing long-term acoustic structures. To reduce computational complexity, we employ a grouped convolution strategy, with the number of groups being 4, 16, and 64 respectively, expanding the number of channels to 2048. Finally, discriminative scores are output through 5×1 and 3×1 convolutional layers. For the choice of adversarial loss function, discriminative loss is used to stabilize the training process. in, These are raw breathing audio samples from a real dataset. Reconstruct the generated audio samples for the generator. For the discriminator to judge real samples The scoring (discrimination score). For the discriminator to generate samples The score. To correct the linear unit function, it is used to ignore samples that have been correctly and confidently classified by the discriminator (i.e., samples with scores above the boundary of 1 or below -1), thus focusing on samples that are difficult to distinguish. This indicates the operation of calculating the mathematical average, using... For example, its meaning is for all distributions from the generator. Generate samples Calculate the discriminator output The mathematical average, simply put, is the average score of the generated samples on the discriminator. A high score indicates that the discriminator considers the sample to be real.

[0071] The adversarial loss of the generator aims to deceive the discriminator into giving high scores to the generated samples: The model's total loss is a weighted sum of three parts, constraining the model from three dimensions: numerical accuracy, frequency domain consistency, and adversarial loss. We use L1 loss to measure the numerical error in the masked region. The penalty for large errors in L1 loss increases linearly, and compared to mean squared error (MSE) loss, it is more tolerant of spike signals, avoiding overly smooth reconstruction results. Among them This represents the mask sequence obtained above; here, to unify units, its coordinates are mapped to the time axis. Also, this applies to different FFT window sizes. Calculate the spectral convergence loss and logarithmic magnitude loss to ensure consistency in the time and frequency domains.

[0072] in, Indicates the window size to use The short-time Fourier transform operation performed This represents the amplitude spectrum of the complex spectrum. Used to calculate spectral convergence loss, emphasizing the fit to high-energy regions (such as resonance peaks). This term is used to calculate the logarithmic magnitude loss. Since the logarithmic operation narrows the energy differences, this term encourages the model to focus on the recovery of details in low-energy regions.

[0073] The final total loss is the weighted sum of the losses mentioned above, defined as: In this scheme, after grid search and empirical tuning, the hyperparameters can be set as follows: , , .

[0074] In the specific implementation process, in the step of determining the respiratory monitoring results based on the generative adversarial network, the generator of the generative adversarial network outputs a reconstructed waveform and calculates the time difference between the start times of two adjacent inspiratory cycles. Obtain the instantaneous respiratory rate (RR), expressed in breaths per minute (bpm). By matching adjacent inspiratory and expiratory phases, the algorithm accurately calculates the duration of inhalation within a single respiratory cycle. With exhalation duration The ratio of breathing rate to energy expenditure (I:E ratio) is a core parameter for assessing a diver's breathing ratio. .

[0075] Specifically, breath-holding is characterized by a prolonged absence of respiratory signals. This protocol sets the breath-holding duration to... The algorithm continuously monitors the most recent peak respiratory rate. ,like If this occurs, a breath-holding alarm will be triggered. The breath-holding time is the interval between the peak values ​​of the two respiratory cycle waveforms.

[0076] Rapid breathing detection: Hyperventilation is typically characterized by an abnormally high respiratory rate. To prevent false alarms triggered by brief fluctuations in respiratory rate due to exercise, this scheme employs a dual-condition system based on a sliding window: That is, when the average respiratory rate Exceeding the warning threshold And the respiratory rhythm remains relatively stable (standard deviation of respiratory rate) Less than the preset stability threshold When this occurs, excluding sudden movements such as coughing or swallowing, it is determined to be hyperventilation and an alarm is issued.

[0077] To overcome the limitations of a single energy index over a wide dynamic range, this scheme utilizes the high sensitivity of Mel frequency cepstral coefficients (MFCC) to acoustic textures to capture subtle breathing initiation points in the reconstructed waveform.

[0078] (1) Envelope construction First, MFCC features are extracted from the preprocessed signal. Compared to time-domain amplitude, cepstral features are more sensitive to broadband turbulence textures generated by airflow. The 0th-order coefficients (DC components) of the MFCC are selected as the initial respiratory energy envelope. This coefficient characterizes the full-band energy distribution in the logarithmic domain and can effectively resist linear amplitude attenuation caused by changes in water pressure.

[0079] (2) Savitzky-Golay filter primitive High-frequency jitter and spurious peaks are often present. Traditional linear smoothing filters (such as mean filters), while suppressing noise, inevitably blur the peak and valley boundaries of the waveform. To accurately locate the start and end points of breathing, this study uses a Savitzky-Golay filter to filter the original waveform. The Savitzky-Golay filter utilizes local least squares to perform polynomial fitting on the signal within a sliding window. Let the filter window length be... The order of the fitted polynomial is For the center point Filtered envelope Defined as: in These are the pre-calculated convolution coefficients. Unlike linear smoothing methods such as mean filtering, which tend to suppress all high-frequency components, the SG filter, through local polynomial fitting, can effectively preserve the transient trend of the signal. The breathing envelope has a significant rising-falling edge transformation in the inhalation-exhalation transition region. This characteristic allows the algorithm to filter out high-frequency jitter while greatly avoiding signal distortion, maximizing the prominence of peaks and the sharpness of troughs.

[0080] (3) Boundary positioning Based on the filtered envelope The algorithm executes a trough detection strategy. The transition point of respiratory airflow (i.e., the moment of alternation between inhalation and exhalation) corresponds to the minimum point of the envelope. A peak search algorithm is then used to... The extreme values ​​are searched to determine the set of start and end boundaries of candidate respiratory events. .

[0081] (4) Phase discrimination While SG filtering accurately pinpoints the temporal boundaries of breathing events, it does not provide category labels for inhalation or exhalation. It is worth noting that relying solely on temporal energy (amplitude) for phase division is unreliable in underwater environments. The additional water pressure differences caused by changes in diver attitude lead to drastic fluctuations in breathing energy within a short period, while anxiety-induced shallow and rapid breathing manifests as extremely weak pulses. This significant dynamic difference makes traditional fixed-amplitude threshold algorithms prone to failure.

[0082] In contrast, the frequency domain distribution based on physical acoustic properties exhibits extremely high stability. Regardless of energy intensity, the high-speed turbulence generated by the inspiratory airflow through the airflow path is always accompanied by a significant high-frequency component, while the exhalation airflow sound is mainly concentrated in the mid-to-low frequencies. Therefore, the algorithm calculates each candidate segment... The spectral centroid serves as a discriminant feature for phase classification: in For frequency values, For the corresponding spectral amplitude. Set the centroid threshold. ,like If it is not in the inhalation phase, it is determined to be the inhalation phase; otherwise, it is the exhalation phase.

[0083] (5) Semantic consistency merging To address the oversegmentation problem caused by shallow and rapid breathing, this scheme introduces a semantic consistency merging strategy. If the time interval between two adjacent candidate segments... Furthermore, if the breaths are classified as being in the same phase (e.g., two consecutive short inspiratory segments), the algorithm determines that they are fragments of the same respiratory event and performs a merging operation. This mechanism effectively avoids statistical inversion caused by continuous exhalation or intermittent inhalation.

[0084] Experimental Example Hardware Preparation: The hardware required for this algorithm implementation is a custom microphone array implemented on a PCB circuit board, using an STM32H743IIT6 as the main control chip, and four MEMS microphones to collect sound signals at a sampling rate of 48kHz. The microphone array is connected to a Raspberry Pi 4B as the host computer.

[0085] Algorithm implementation: The algorithm was implemented using Python and deployed on a Raspberry Pi 4B to output respiratory rate and respiratory ratio data in real time.

[0086] To quantify the system's monitoring capability in dynamic underwater environments, the experimental case used mean absolute error as the core evaluation metric. Breathing rate error was defined as the absolute difference between the system's estimated breaths per minute (BPM) and the true reference value. Figure 9 , Figure 10 and Figure 11 The system's performance in respiratory rate, respiratory ratio, and anomaly detection was demonstrated. Under varying water depths and motion intensities, the system maintained extremely low error levels (average <1 BPM), proving its adaptability to changing environments and motion states. The respiratory ratio error was defined as the deviation between the system-detected inspiratory / expiratory time ratio and the actual ratio. Experimental results show that the system can accurately capture the transition points of the respiratory phase, effectively verifying the high fidelity of the waveform reconstruction.

[0087] For two critical types of abnormal breathing events in diving safety—breath-holding and rapid breathing—we introduced true positive rate and false positive rate for evaluation. The true positive rate refers to the proportion of abnormal events correctly identified by the system out of the total number of actual abnormal events. The false positive rate refers to the proportion of normal breathing segments incorrectly labeled as abnormal out of all events detected by the system. Experimental results show that the system not only effectively avoids missed detections to ensure safety but also significantly reduces the psychological burden on divers caused by false alarms.

[0088] This invention also provides a diver breathing monitoring system for closed-circuit respirators. The system includes a computer device, which includes a processor and a memory. The memory stores computer instructions, and the processor executes the computer instructions stored in the memory. When the computer instructions are executed by the processor, the system performs the steps described above.

[0089] This invention also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the aforementioned method for monitoring the breathing of divers using a closed-circuit respirator. The computer-readable storage medium can be a tangible storage medium, such as random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disks, removable storage disks, CD-ROMs, or any other form of storage medium known in the art.

[0090] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this invention are programs or code segments used to perform the desired tasks. The programs or code segments can be stored in a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried in a carrier wave.

[0091] It should be clarified that the present invention is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of the present invention.

[0092] In this invention, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.

[0093] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations of the embodiments of the present invention are possible. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for monitoring the breathing of divers using closed-circuit respirators, characterized in that, The steps of the method include: Audio data is collected from microphones in a microphone array. The audio data of each microphone's corresponding audio channel includes multiple data frame signals. For each audio channel's data frame signal, a perceptual quality score is calculated. Spatial attention weights are calculated based on the perceived quality scores of the data frame signals of the same frame from each audio channel. The data frame signals of the same frame from each audio channel are then fused based on the spatial attention weights to obtain a fused frame signal. The fused frame signal is subjected to a short-time Fourier transform to obtain the first STFT amplitude spectrum. The first STFT amplitude spectrum is then mapped to the Mel frequency domain to obtain the initial Mel spectrum. The filtered Mel spectrum signal is calculated based on the initial Mel spectrum. The noise residual is calculated based on the filtered Mel spectrum signal and the initial Mel spectrum. A binary mask is constructed based on the noise residual. The binary mask is applied to the initial Mel spectrum to obtain a zeroed Mel spectrum. The zeroed Mel spectrum is mapped back to the second STFT amplitude spectrum by inverse Mel transform. The second STFT amplitude spectrum is then subjected to inverse short-time Fourier transform to construct a time-domain masking signal. The time-domain masking signal is input into a preset generative adversarial network (GAN), and the respiratory monitoring result is determined based on the GAN.

2. The method for monitoring diver breathing in a closed-circuit respirator according to claim 1, characterized in that, In the step of calculating the perceptual quality score of the data frame signal for each audio channel's data frame signal... For each sample point of each data frame signal, calculate the normalized power spectral density; The normalized power spectral density is calculated for each sampling point of the data frame signal, the normalized spectral entropy of the data frame signal is calculated, and the amplitude saturation penalty term of the data frame signal is calculated. Based on the normalized spectral entropy and the amplitude saturation penalty term, the perceived quality score of the data frame signal of the current audio channel is calculated.

3. The method for monitoring diver breathing in a closed-circuit respirator according to claim 2, characterized in that, In the step of calculating the normalized spectral entropy of the data frame signal based on the normalized power spectral density at each sampling point of the data frame signal, the normalized spectral entropy is calculated using the following formula: in, Indicates audio channel The The normalized spectral entropy of a data frame signal. Indicates audio channel The Sampling points of each data frame signal The normalized power spectral density, Indicates audio channel The The number of sampling points for each data frame signal.

4. The method for monitoring diver breathing in a closed-circuit respirator according to claim 2, characterized in that, In the step of calculating the amplitude saturation penalty term for the data frame signal, the amplitude saturation penalty term is calculated using the following formula: in, Indicates audio channel The The amplitude saturation penalty term for each data frame signal. Indicates audio channel The Sampling points of each data frame signal The amplitude value, This indicates the preset acoustic overload threshold. Indicates audio channel The The number of sampling points for each data frame signal.

5. The method for monitoring diver breathing in a closed-circuit respirator according to claim 2, characterized in that, In the step of calculating the perceived quality score of the data frame signal for the current audio channel based on the normalized spectral entropy and amplitude saturation penalty term, the perceived quality score of the data frame signal is calculated using the following formula: in, Indicates audio channel The The perceived quality score of each data frame signal. Indicates audio channel The The amplitude saturation penalty term for each data frame signal. Indicates audio channel The The normalized spectral entropy of a data frame signal. and All of these are preset balance coefficients.

6. The method for monitoring diver breathing in a closed-circuit respirator according to claim 1, characterized in that, In the step of calculating the filtered Mel spectrum signal based on the initial Mel spectrum, a horizontal closing operation is performed on the initial Mel spectrum to obtain a closing operation signal, and a vertical opening operation is performed on the closing operation signal to obtain an opening operation signal. The filtered Mel spectrum signal is then calculated based on the opening operation signal and the initial Mel spectrum.

7. The method for monitoring diver breathing in a closed-circuit respirator according to claim 6, characterized in that, In the step of performing a horizontal closing operation on the initial Mel spectrum to obtain a closing signal, performing a vertical opening operation on the closing signal to obtain an opening signal, and calculating the filtered Mel spectrum signal based on the opening signal and the initial Mel spectrum, the closing signal is calculated using the following formula: in, Indicates the first The fused frame signal corresponding to the data frame signal in the th ... A closed-loop signal with a frequency band of 1 Mel. Indicates the first The fused frame signal corresponding to the data frame signal in the th ... Transpose of the initial Mel spectrum of each Mel frequency band This represents a preset horizontal structural element. This represents the expansion operation. This represents the erosion operation; The opening operation signal is calculated using the following formula: in, Indicates the first The fused frame signal corresponding to the data frame signal in the th ... Open operation signal of 1 Mel frequency band This represents a preset vertical structural element; The filtered Mel spectrum signal is calculated using the following formula: in, Indicates the first The fused frame signal corresponding to the data frame signal in the th ... Filtered Mel spectrum signal in one Mel frequency band Indicates the first The fused frame signal corresponding to the data frame signal in the th ... Transpose of the open operation signal of 1 Mel frequency band Indicates the first The fused frame signal corresponding to the data frame signal in the th ... The maximum value in the initial Mel spectrum of each Mel band. Indicates the first The fused frame signal corresponding to the data frame signal in the th ... The maximum value among the open operation signals of the Mel frequency band.

8. The method for monitoring diver breathing in a closed-circuit respirator according to claim 1, characterized in that, In the step of calculating spatial attention weights based on the perceived quality scores of data frame signals of the same frame from each audio channel, and fusing the data frame signals of the same frame from each audio channel based on the spatial attention weights to obtain a fused frame signal, the optimal phase lag value for each audio channel is calculated based on the data frame signals of the reference anchor channel, and the fused frame signal is calculated based on the optimal phase lag value and the spatial attention weights.

9. The method for monitoring diver breathing in a closed-circuit respirator according to claim 8, characterized in that, In the step of calculating the fused frame signal based on the optimal phase lag value and the spatial attention weight, the amplitude value of each sampling point in the fused frame signal is calculated based on the optimal phase lag value and the spatial attention weight, and the amplitude values ​​of each sampling point are combined to obtain the fused frame signal. The amplitude value of each sampling point in the fused frame signal is calculated using the following formula: in, Indicates the first frame in the fused frame signal Sampling points of each data frame signal The amplitude value, Indicates the number of audio channels. Indicates audio channel The Spatial attention weights for each data frame signal Indicates audio channel The Each data frame signal at the sampling point The amplitude value.

10. A diver breathing monitoring system for closed-circuit respirators, characterized in that, The system includes a computer device, which includes a processor and a memory. The memory stores computer instructions, and the processor executes the computer instructions stored in the memory. When the computer instructions are executed by the processor, the system implements the steps of the method as described in any one of claims 1 to 9.