Sleep assistance mixing control method and device with dual audio source intelligent superposition

By identifying stationary and non-stationary segments in audio, calculating energy distribution and time-shift compensation, constructing a gain coefficient vector, and performing intelligent mixing control with spectral balance and rhythmic complementarity, the problem of spectral conflict and auditory stimulation in traditional mixing is solved, thus improving the sleep-aiding effect.

CN122245339APending Publication Date: 2026-06-19SHENZHEN GLOCUSENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN GLOCUSENT TECH CO LTD
Filing Date
2026-05-12
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Traditional dual-source mixing methods fail to accurately identify stable and non-stable segments within the audio, lack adaptive gain adjustment and time delay compensation for differences in frequency band energy distribution, leading to spectral conflicts and auditory stimulation, which affects the sleep aid effect.

Method used

By acquiring the cyclic stationary identifier of the audio, calculating the energy distribution value and time forward compensation, constructing the gain coefficient vector, and performing intelligent mixing control with spectral balance and rhythmic complementarity, combined with time-varying weighted modulation and linear attenuation gain, adaptive frequency band adjustment is achieved.

Benefits of technology

It improves the sound quality and sleep-aiding effect of dual-source mixing, solves the problems of spectrum conflict and auditory stimulation, and achieves spectrum balance, rhythm complementarity and dynamic volume control.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245339A_ABST
    Figure CN122245339A_ABST
Patent Text Reader

Abstract

This invention relates to the field of audio mixing control technology, and discloses a sleep-aid audio mixing control method and device for intelligent superposition of dual audio sources. The method includes: acquiring a first cyclic stationary identifier of a first original audio and a second cyclic stationary identifier of a second original audio; calculating a first energy distribution value of the first original audio, calculating a second energy distribution value of the second original audio, and determining a gain coefficient vector; determining a first stationary audio in the first original audio and performing time-forward compensation to obtain a first compensated audio, and determining a second stationary audio in the second original audio and performing time-forward compensation to obtain a second compensated audio; applying the gain coefficient vector to mix the first compensated audio and the second compensated audio, and outputting a target audio signal. This method achieves sleep-aid audio output with spectral balance, rhythmic complementarity, and dynamic volume control, improving the sound quality and sleep-aiding effect of dual-source mixing.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of audio mixing control technology, and in particular to a sleep-assisted audio mixing control method and device for intelligent superposition of dual audio sources. Background Technology

[0002] Audio mixing technology for sleep aids has significant application value in improving sleep quality. Traditional dual-source mixing methods typically use fixed weighting coefficients for simple linear superposition, directly mixing natural environmental sounds such as birdsong and rain with musical audio such as piano and strings. This fails to consider the complex characteristics of the same audio signal containing both periodic and non-periodic cyclical stationary segments. Existing technologies have three prominent problems in the mixing process: First, they do not accurately identify and differentiate the stationary and non-stationary segments within each audio stream, resulting in an inability to optimize rhythmic features; second, they lack an adaptive gain adjustment mechanism based on differences in frequency band energy distribution, which can easily lead to spectral conflicts and a muddy sound when both audio streams have strong energy in the same frequency band; third, they do not perform time delay compensation, so when the rhythmic peaks of both audio streams occur simultaneously, they can create auditory stimulation and affect the sleep aid effect. Summary of the Invention

[0003] This invention provides a method and device for controlling sleep-aid mixing with intelligent superposition of dual audio sources. This invention achieves sleep-aid mixing output with spectrum balance, rhythm complementarity, and dynamic volume control, thereby improving the sound quality and sleep-aiding effect of dual audio source mixing.

[0004] In a first aspect, the present invention provides a sleep-aid mixing control method for intelligent superposition of dual audio sources, the sleep-aid mixing control method for intelligent superposition of dual audio sources comprising: Obtain the first loop stationary identifier of the first original audio and the second loop stationary identifier of the second original audio; Calculate the first energy distribution value of each first sub-band in the first original audio, calculate the second energy distribution value of each second sub-band in the second original audio, and determine the gain coefficient vector based on the first energy distribution value and the second energy distribution value; The first stable audio in the first original audio is determined according to the first cyclic stable identifier and time shift compensation is performed to obtain the first compensated audio. The second stable audio in the second original audio is determined according to the second cyclic stable identifier and time shift compensation is performed to obtain the second compensated audio. The first compensated audio and the second compensated audio are mixed using the gain coefficient vector to output the target audio signal.

[0005] In conjunction with the first aspect, in a first implementation of the first aspect of the present invention, obtaining the first cyclic stability identifier of the first original audio and the second cyclic stability identifier of the second original audio includes: Calculate the first short-time energy envelope function of the first original audio and the second short-time energy envelope function of the second original audio; Autocorrelation analysis is performed on the first short-time energy envelope function to obtain the first envelope periodicity index, and autocorrelation analysis is performed on the second short-time energy envelope function to obtain the second envelope periodicity index. Based on the first envelope periodicity index, determine whether the original audio signal is a periodic or non-periodic cyclic stationary signal and record the first cyclic stationary indicator; based on the second envelope periodicity index, determine whether the original audio signal is a periodic or non-periodic cyclic stationary signal and record the second cyclic stationary indicator.

[0006] In conjunction with the first aspect, in a second implementation of the first aspect of the present invention, the step of calculating the first energy distribution value of each first sub-frequency band in the first original audio, calculating the second energy distribution value of each second sub-frequency band in the second original audio, and determining the gain coefficient vector based on the first energy distribution value and the second energy distribution value includes: The first original audio is divided into multiple first sub-bands and the sum of the squares of the spectral amplitudes of the frequency points in each first sub-band is accumulated to obtain a first energy distribution value. The second original audio is divided into multiple second sub-bands and the sum of the squares of the spectral amplitudes of the frequency points in each second sub-band is accumulated to obtain a second energy distribution value. For each sub-band, calculate the corresponding first energy distribution value and second energy distribution value to calculate the band energy difference index; The frequency band complementary frequency band is determined based on the frequency band energy difference index and the first preset threshold, and the frequency band conflicting frequency band is determined based on the frequency band energy difference index and the second preset threshold. A gain amplification factor is constructed for the energy complementary frequency band, a gain attenuation factor is constructed for the energy conflict frequency band, and a gain coefficient vector is generated based on the gain amplification factor and the gain attenuation factor.

[0007] In conjunction with the first aspect, in a third implementation of the first aspect of the present invention, the step of constructing a gain amplification coefficient for the energy complementary frequency band, constructing a gain attenuation coefficient for the energy conflict frequency band, and generating a gain coefficient vector based on the gain amplification coefficient and the gain attenuation coefficient includes: For the energy complementary frequency band, the first normalization coefficient is obtained by subtracting the first preset threshold from the frequency band energy difference index and dividing it by the first preset difference range. The first normalization coefficient is then multiplied by the first preset gain amplitude and added to the first reference gain value to obtain the gain amplification coefficient. For the energy conflict frequency band, the second normalization coefficient is obtained by subtracting the frequency band energy difference index from the second preset threshold and then dividing by the second preset threshold. The gain attenuation coefficient is obtained by subtracting the product of the second normalization coefficient and the second preset attenuation amplitude from the second reference gain value. A gain coefficient vector is constructed based on the gain amplification factor and the gain attenuation factor.

[0008] In conjunction with the first aspect, in a fourth implementation of the first aspect of the present invention, the step of determining the first stable audio in the first original audio based on the first cyclic stationary identifier and performing time-shift compensation to obtain the first compensated audio, and determining the second stable audio in the second original audio based on the second cyclic stationary identifier and performing time-shift compensation to obtain the second compensated audio, includes: Calculate the cross-correlation function numerical sequence of the first original audio and the second original audio, and search for the time delay parameter corresponding to the position of the maximum value of the cross-correlation function numerical sequence as the initial phase difference; Based on the first cyclic stationary identifier, the first stationary audio in the first original audio is determined and the corresponding first beat period is obtained. Half of the first beat period is subtracted from the initial phase difference, multiplied by the sampling frequency, and rounded to obtain the first time forward sampling point number. Based on the second cyclic stationary identifier, the second stationary audio in the second original audio is determined and the corresponding second beat period is obtained. Half of the second beat period is subtracted from the initial phase difference, multiplied by the sampling frequency, and rounded to obtain the second time forward sampling point number. The first stable audio is shifted forward by the number of sampling points forward according to the first time to obtain the first compensated audio, and the second stable audio is shifted forward by the number of sampling points forward according to the second time to obtain the second compensated audio.

[0009] In conjunction with the first aspect, in a fifth implementation of the first aspect of the present invention, the step of calculating the numerical sequence of the cross-correlation function of the first original audio and the second original audio, and searching for the time delay parameter corresponding to the position of the maximum value of the cross-correlation function numerical sequence as the initial phase difference, includes: The minimum and maximum delay times within the preset delay range are multiplied by the sampling frequency to obtain the lower limit and upper limit of the delay sampling points, respectively. The sampling point search interval is then determined based on the lower and upper limits of the delay sampling points. For each delayed sampling point value within the sampling point search interval, the sampling points of the first original audio are multiplied one by one with the sampling points of the corresponding delayed position of the second original audio, and then summed to obtain the cross-correlation function value corresponding to the delayed sampling point value. The cross-correlation function value sequence is obtained by traversing the sampling point search interval. Find the cross-correlation function value with the largest value in the cross-correlation function value sequence, and divide the delayed sampling point value corresponding to the largest cross-correlation function value by the sampling frequency to obtain the initial phase difference.

[0010] In conjunction with the first aspect, in a sixth implementation of the first aspect of the present invention, the step of applying the gain coefficient vector to mix the first compensated audio and the second compensated audio to output a target audio signal includes: The first non-stationary audio in the first original audio is determined according to the first cyclic stationary identifier, and the second non-stationary audio in the second original audio is determined according to the second cyclic stationary identifier. Assign a first basic weight coefficient to the first non-stationary audio and the second non-stationary audio, assign a second basic weight coefficient to the first stationary audio and the second stationary audio, and introduce a sinusoidal time-varying modulation function to construct the first time-varying weight coefficient and the second time-varying weight coefficient. Perform a Fourier transform on the first compensated audio to obtain first frequency domain data, and perform a Fourier transform on the second compensated audio to obtain second frequency domain data; The first weighted frequency domain data is obtained by multiplying the first time-varying weight coefficient, the first frequency domain data, and the gain coefficient vector point by point; the second weighted frequency domain data is obtained by multiplying the second time-varying weight coefficient, the second frequency domain data, and the gain coefficient vector point by point. The first weighted frequency domain data and the second weighted frequency domain data are summed point by point to obtain mixed frequency domain data, and the mixed frequency domain data are subjected to inverse Fourier transform to obtain a time-domain mixed signal. A linear attenuation gain is applied to the time-domain mixed signal during the sleep induction stage, the light sleep maintenance stage, and the deep sleep stage, respectively, and the target audio signal is output.

[0011] In conjunction with the first aspect, in the seventh implementation of the first aspect of the present invention, the step of multiplying the first time-varying weighting coefficient, the first frequency domain data, and the gain coefficient vector frequency-by-frequency to obtain the first weighted frequency domain data, and multiplying the second time-varying weighting coefficient, the second frequency domain data, and the gain coefficient vector frequency-by-frequency to obtain the second weighted frequency domain data, includes: For each first frequency point in the first frequency domain data, the first time-varying weighting coefficient is multiplied by the spectral amplitude of the first frequency point and then multiplied by the gain coefficient corresponding to the sub-frequency band to which the first frequency point belongs in the gain coefficient vector to obtain the first weighted frequency domain amplitude of the first frequency point. The first weighted frequency domain data is obtained by traversing all first frequency points. For each second frequency point in the second frequency domain data, the second time-varying weighting coefficient is multiplied by the spectral amplitude of the second frequency point and then multiplied by the gain coefficient corresponding to the sub-frequency band to which the second frequency point belongs in the gain coefficient vector to obtain the second weighted frequency domain amplitude of the second frequency point. The second weighted frequency domain data is obtained by traversing all second frequency points. The weighted frequency domain amplitudes of the first weighted frequency domain data and the second weighted frequency domain data are added one by one to obtain the mixed frequency domain data.

[0012] In conjunction with the first aspect, in the eighth implementation of the first aspect of the present invention, the step of applying a linear attenuation gain to the time-domain mixed signal during the sleep induction stage, the light sleep maintenance stage, and the deep sleep stage, respectively, and outputting a target audio signal, includes: The sleep stage is determined based on the playback time. When the playback time is between the first and second time, it is determined to be the sleep induction stage. When the playback time is between the second and third time, it is determined to be the light sleep maintenance stage. When the playback time exceeds the third time, it is determined to be the deep sleep stage. During the sleep induction phase, the first-stage gain coefficient is obtained by multiplying the initial gain value by 1, subtracting the ratio of the playback time to the duration of the first stage, and then multiplying by the first attenuation ratio. During the light sleep maintenance phase, the second-stage gain coefficient is obtained by multiplying the gain value at the end of the sleep induction phase by 1, subtracting the ratio of the playback time offset to the duration of the second stage, and then multiplying by the second attenuation ratio. During the deep sleep phase, a constant third-stage gain coefficient is used. The time-domain mixed signal is subjected to Fourier transform and low-frequency band enhancement to obtain low-frequency enhanced frequency domain data. The stage gain coefficient corresponding to the current sleep stage is multiplied with the low-frequency enhanced frequency domain data and then subjected to inverse Fourier transform to output the target audio signal.

[0013] Secondly, the present invention provides a sleep-aid mixing control device with intelligent superposition of dual audio sources, the sleep-aid mixing control device comprising: The acquisition module is used to acquire the first loop-stable identifier of the first original audio and the second loop-stable identifier of the second original audio. The calculation module is used to calculate the first energy distribution value of each first sub-frequency band in the first original audio, calculate the second energy distribution value of each second sub-frequency band in the second original audio, and determine the gain coefficient vector based on the first energy distribution value and the second energy distribution value; The compensation module is used to determine the first stable audio in the first original audio according to the first cyclic stable identifier and perform time forward compensation to obtain the first compensated audio, and to determine the second stable audio in the second original audio according to the second cyclic stable identifier and perform time forward compensation to obtain the second compensated audio. The output module is used to mix the first compensated audio and the second compensated audio using the gain coefficient vector, and output the target audio signal.

[0014] The technical solution provided by this invention achieves intelligent overlay mixing control by accurately identifying and differentiating the stable and non-stable segments that exist simultaneously within the first and second original audio files, combined with the optimal frequency band adaptive gain adjustment and time delay compensation mechanism. This invention identifies periodic and non-periodic cyclic stationary signal segments within each audio stream using short-time energy envelope function and autocorrelation analysis. By dividing the audio into multiple sub-bands and calculating the frequency band energy difference index, it constructs gain amplification coefficients for energy-complementary frequency bands and gain attenuation coefficients for energy-conflicting frequency bands, forming a frequency band adaptive gain coefficient vector. This solves the muddiness problem caused by spectral conflict in traditional mixing. The invention calculates the initial phase difference using a cross-correlation function and performs time-shift compensation on stationary audio segments based on the beat cycle, ensuring that the rhythm peaks of the two audio streams are staggered in the time domain, avoiding auditory stimulation caused by rhythm superposition. By assigning differentiated basic weight coefficients to non-stationary and stationary audio segments, and introducing a sinusoidal time-varying modulation function to construct time-varying weight coefficients, it performs weighted mixing in the frequency domain and applies the frequency band adaptive gain coefficient vector. Combined with three-segment linear attenuation gain and low-frequency enhancement processing, it achieves a sleep-aiding mixed output with spectral balance, rhythmic complementarity, and dynamic volume control, improving the sound quality and sleep-aiding effect of dual-source mixing.

[0015] Other features and advantages of the invention will be set forth in the description which follows, and will be apparent in part from the description, or may be learned by practicing the invention. The objects and other advantages of the invention are realized and obtained in accordance with the structures particularly pointed out in the description, claims and drawings.

[0016] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description

[0017] Figure 1 This is a schematic diagram of an embodiment of the sleep-aid mixing control method for intelligent superposition of dual audio sources in this invention. Figure 2This is a schematic diagram of one embodiment of the sleep-aid mixing control device with intelligent superposition of dual audio sources in this invention. Detailed Implementation

[0018] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0019] The terms "comprising" and "having," and any variations thereof, used in the embodiments of this invention are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the steps or units listed, but may optionally include other steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or devices.

[0020] To facilitate understanding of this embodiment, a sleep-aided mixing control method for intelligent superposition of dual audio sources, as disclosed in this embodiment of the invention, will first be described in detail. For example... Figure 1 As shown, this method includes the following steps: 101. Obtain the first loop-stable identifier of the first original audio and the second loop-stable identifier of the second original audio; Specifically, a short-time energy calculation operation with a window length of 2048 sampling points is performed on the first original audio signal. The sampling frequency is set to 44100Hz, and each 2048 consecutive sampling points is used as an analysis window. By summing the squared amplitudes of each sampling point within the window and sliding the window in time order, a first short-time energy envelope function is constructed, effectively capturing the energy change trend of the audio signal on an approximate 46-millisecond time scale. The same processing method is performed on the second original audio signal to obtain a second short-time energy envelope function. An autocorrelation function is calculated on the first short-time energy envelope function, that is, using different time delays τ as variables, the original function is multiplied and accumulated with its own delayed version to form a first autocorrelation function curve. By analyzing the local extreme points of the autocorrelation function curve, the delay time length corresponding to the position of the main peak is identified, and the delay time length is used as the first envelope periodicity index to reflect the rhythmic repetition characteristics of the first audio signal in the time domain. Autocorrelation analysis is performed on the second short-time energy envelope function to obtain a second envelope periodicity index. Cyclostationarity is classified and judged based on the specific numerical range of each envelope periodicity index: when the envelope periodicity index is between 0.5 seconds and 3 seconds, it indicates that the audio has obvious periodic rhythmic characteristics, meeting the judgment criteria for periodic cyclostationary signals, and the cyclostationarity index of this audio is set to 1; when the envelope periodicity index exceeds 5 seconds or the autocorrelation function does not show an obvious main peak, it indicates that the audio lacks obvious rhythmic repetition and belongs to the category of non-periodic cyclostationary signals, and its cyclostationarity index is set to 0. The judgment result corresponding to the first audio is registered as the first cyclostationarity index, and the judgment result corresponding to the second audio is registered as the second cyclostationarity index.

[0021] 102. Calculate the first energy distribution value of each first sub-band in the first original audio, calculate the second energy distribution value of each second sub-band in the second original audio, and determine the gain coefficient vector based on the first energy distribution value and the second energy distribution value; Specifically, the two original audio signals are converted to frequency domain representations. A 2048-point Fast Fourier Transform is performed on each audio signal to map it from a time-domain signal to a frequency distribution. The frequency domain amplitude data is then divided into multiple equal-width sub-bands, uniformly dividing the entire frequency range from 0Hz to 22050Hz into 64 sub-bands. Each sub-band has a bandwidth of approximately 344.53Hz. Within each sub-band, the sum of the squares of the spectral amplitudes at all frequency points within that sub-band is accumulated to form the energy distribution value set for each sub-band of the first audio signal and the corresponding energy distribution value set for the second audio signal, denoted as the first energy distribution value and the second energy distribution value, respectively. For each corresponding sub-band position, an energy difference index is calculated between the first and second energy distribution values. This energy difference index is obtained by calculating the absolute value of the difference between the two values ​​and dividing by the normalized sum, quantifying the degree of energy complementarity or conflict between the sub-bands in the two audio signals. The energy difference index of each frequency band is compared with two preset numerical thresholds. If the energy difference index is higher than the first preset threshold of 0.55, the sub-frequency band is determined to be an energy complementary frequency band. This type of frequency band has the characteristic of one channel having high energy and the other having low energy, which is suitable for enhancing its overall performance. Based on this, a gain amplification factor is constructed for the frequency band to make it more prominent after mixing. If the energy difference index is lower than the second preset threshold of 0.25, the frequency band is determined to be an energy conflicting frequency band, indicating that the frequency band has high energy in both audio channels. If no treatment is done, it will cause spectral energy stacking and distortion after mixing. A gain attenuation factor is applied to this type of frequency band to control its output energy and alleviate the conflict. For intermediate frequency bands that do not fall into the above two categories, their gain factor is kept at 1.0 by default. The gain amplification factor, gain attenuation factor and hold value corresponding to all 64 sub-frequency bands are organized in order of frequency band to construct a gain factor vector of length 64.

[0022] 103. Determine the first stable audio in the first original audio based on the first cyclic stable identifier and perform time forward compensation to obtain the first compensated audio; determine the second stable audio in the second original audio based on the second cyclic stable identifier and perform time forward compensation to obtain the second compensated audio. Specifically, the temporal relative positional relationship between the first and second original audio tracks is modeled. By calculating the numerical sequence of the cross-correlation function of the two audio tracks, the degree of matching between them under different time delays is obtained. The cross-correlation function is defined as the sum of the product of the first and second audio tracks under different delays τ. Within a set sleep-aid applicable delay window (e.g., within 50 milliseconds to 800 milliseconds), the cross-correlation function curve is scanned point by point, and the time delay parameter corresponding to the position where the maximum value is obtained is recorded. This time delay is the initial phase difference between the two audio tracks, which characterizes the degree of offset of the rhythm peaks of the two tracks on the time axis. Read the first cycle stationary indicator and determine whether the first audio signal has a periodic rhythmic structure based on the indicator value. If it is a periodic signal, extract the beat period value obtained from its autocorrelation analysis and record it as the first beat period. Perform time compensation calculation accordingly. Subtract the initial phase difference from half of the first beat period, multiply by the sampling frequency (e.g., 44100Hz), and round down to obtain the first time forward sampling point number, which represents the sampling displacement amount that needs to be shifted forward. Similarly, read the second cycle stationary indicator and determine whether the second audio signal constitutes a periodic cycle stationary signal. If it does, obtain its second beat period. Subtract the initial phase difference from half of the beat period, multiply by the sampling frequency, and round down to obtain the second time forward sampling point number. After obtaining the compensation shifts of the two audio signals, the sampling data of the first periodic audio is shifted forward by the number of time advance sampling points, thus advancing the subsequent segments of the original signal to the front, forming the first compensated audio after rhythm structure adjustment; at the same time, the same forward shift operation is performed on the second periodic audio signal, shifting the signal forward by the number of time advance sampling points calculated, thus forming the second compensated audio.

[0023] 104. Apply the gain coefficient vector to mix the first compensated audio and the second compensated audio, and output the target audio signal.

[0024] Specifically, based on the first and second cyclic stationary indicators, the audio portions in the first original audio that do not belong to a periodic cyclic stationary structure are identified as the first non-stationary audio, and the portions in the second original audio that do not have a clear rhythmic structure are identified as the second non-stationary audio. A base weight coefficient of 0.58 is assigned to these two non-stationary audio segments, while a base weight coefficient of 0.42 is assigned to the first and second compensated audio segments, reflecting the principle that natural non-rhythmic sounds should be the dominant signal and regular rhythmic sounds should be the auxiliary signal in a sleep-aid scenario. To avoid a static or repetitive feeling during mixing, a time-varying modulation mechanism based on a sine function is introduced. The first time-varying weight coefficient oscillates slowly between 0.50 and 0.66, and the second time-varying weight coefficient oscillates slowly between 0.34 and 0.50. The modulation period is set to 120 seconds, allowing the weight coefficients to alternately increase in a slow-periodic fluctuation on the time axis, providing continuous dynamic auditory changes. A 2048-point Fast Fourier Transform is performed on the first and second compensated audio signals respectively to obtain their corresponding first and second frequency domain data. Then, each frequency point within the 64 frequency domain segments is multiplied frequency-by-frequency by the corresponding time-varying weighting coefficient and the gain parameter at the corresponding position in the gain coefficient vector, generating first and second weighted frequency domain data respectively. The first and second weighted frequency domain data are then summed frequency-by-frequency to form mixed frequency domain data. Finally, a 2048-point Inverse Fast Fourier Transform is performed on the mixed frequency domain data to recover the time-domain mixed signal. Considering that the sleep aid process needs to match the physiological stages of human sleep, the entire playback duration is divided into a sleep induction stage (0 to 20 minutes), a light sleep maintenance stage (20 to 40 minutes), and a deep sleep stabilization stage (after 40 minutes). A linearly decreasing total gain adjustment function is applied to the mixed signal in each of these three stages. The first two stages achieve a smooth volume transition through linear attenuation of 12% and 18%, respectively, while the deep sleep stage maintains a constant low gain value of 72% of the initial volume to maintain sleep stability. After applying the corresponding gain functions for each stage, the target audio signal is output.

[0025] In one specific embodiment, the process of performing step 101 may specifically include the following steps: Calculate the first short-time energy envelope function of the first original audio and the second short-time energy envelope function of the second original audio; Autocorrelation analysis was performed on the first short-time energy envelope function to obtain the first envelope periodicity index, and autocorrelation analysis was performed on the second short-time energy envelope function to obtain the second envelope periodicity index. Based on the first envelope periodicity index, determine whether the original audio signal is a periodic or non-periodic cyclic stationary signal and record the first cyclic stationary indicator. Based on the second envelope periodicity index, determine whether the original audio signal is a periodic or non-periodic cyclic stationary signal and record the second cyclic stationary indicator.

[0026] Specifically, the original audio signal is segmented using a window function. Maintaining a frequency of 44100Hz, a fixed analysis window of 2048 sampling points is used to perform frame-by-frame short-time energy calculation. During the calculation, for any given analysis frame, the squared amplitudes of the 2048 consecutive sampling points are summed sequentially, and the summation results are arranged chronologically to construct a short-time energy time series, i.e., the short-time energy envelope function. This reflects the energy variation trend of the audio at a time resolution of 46.44 milliseconds, thus providing a stable and temporally continuous quantitative basis for the rhythmic fluctuations of the audio. After extracting the short-time energy envelope function of the first original audio signal, autocorrelation analysis is performed on the energy envelope function. This involves multiplying it point-by-point with versions delayed by different times and summing the results to obtain an autocorrelation function sequence with respect to the delay time τ. If a clear periodic peak exists in the autocorrelation function sequence... If the time delay corresponding to the location of the main peak is between 0.5 seconds and 3 seconds, the audio is considered to have a stable rhythmic structure and belongs to a periodic cyclic stationary signal. Accordingly, the first cyclic stationary flag is set to 1. Conversely, if the autocorrelation function does not show a clear periodic main peak in the entire analysis interval, or the delay corresponding to the main peak is greater than 5 seconds, it indicates that the audio lacks significant rhythmic periodicity. Therefore, it is judged as a non-periodic cyclic stationary signal, and the first cyclic stationary flag is set to 0. Similarly, the above operation process is repeated for the second original audio to construct the corresponding second short-time energy envelope function. Then, autocorrelation analysis is performed to obtain the second autocorrelation function curve, and the second envelope periodicity index is calculated based on the location of its periodic main peak. By judging whether the second envelope periodicity index falls into the set periodic interval, the determination of whether the second audio belongs to a periodic cyclic stationary signal is completed, and the second cyclic stationary flag is generated.

[0027] In one specific embodiment, the process of performing step 102 may specifically include the following steps: The first original audio is divided into multiple first sub-bands and the sum of the squares of the spectral amplitudes of the frequency points in each first sub-band is accumulated to obtain the first energy distribution value. The second original audio is divided into multiple second sub-bands and the sum of the squares of the spectral amplitudes of the frequency points in each second sub-band is accumulated to obtain the second energy distribution value. For each sub-band, calculate the corresponding first energy distribution value and second energy distribution value to calculate the band energy difference index; The frequency bands with complementary energy are determined based on the frequency band energy difference index and the first preset threshold, and the frequency bands with conflicting energy are determined based on the frequency band energy difference index and the second preset threshold. A gain amplification factor is constructed for the energy complementary frequency band, and a gain attenuation factor is constructed for the energy conflict frequency band. A gain coefficient vector is generated based on the gain amplification factor and the gain attenuation factor.

[0028] Specifically, the first and second original audio signals are converted into frequency domain form. A 2048-point Fast Fourier Transform is used to map each audio signal from the time domain to the frequency domain, obtaining complete spectral data. The effective frequency range from 0Hz to 22050Hz is divided into 64 sub-bands at equal intervals, with each sub-band spanning approximately 344.53Hz. For each sub-band, a spectral energy statistical operation is performed. Within each sub-band, the amplitude of all included frequency points is squared, and the squared values ​​are accumulated to form the total energy within each sub-band. This process is repeated to obtain the first energy distribution value set corresponding to the first original audio signal and the second energy distribution value set corresponding to the second original audio signal. An energy difference index is calculated for the first and second energy distribution values ​​at corresponding frequency band positions. The energy difference index is defined as the absolute value of the difference between the two values ​​divided by their sum, plus a small positive number to avoid a zero denominator, reflecting whether there is significant complementarity or severe overlap in the energy distribution of the two audio signals within that frequency band. The classification status of a frequency band is determined by comparing its energy difference index with two preset thresholds. When the difference index is greater than the first preset threshold of 0.55, it indicates that the energy of one audio track is significantly higher than that of another, and this frequency band is considered a complementary band, suitable for enhancement in mixing. In this case, a gain amplification factor is constructed, with its value varying linearly between 0.95 and 1.1, to enhance the performance of complementary components. When the difference index is lower than the second preset threshold of 0.25, it indicates that both audio tracks have high energy in this frequency band, posing a significant risk of overlap and conflict. Based on this, a gain attenuation factor is constructed, with its value decreasing between 0.65 and 1.0, to weaken the energy output of this frequency band and alleviate spectrum congestion. For neutral frequency bands between the two thresholds, the default gain factor is 1.0 to maintain the original value. The gain amplification factor, attenuation factor, or neutral hold value corresponding to all 64 sub-frequency bands are arranged in order and combined to form a gain factor vector.

[0029] In one specific embodiment, the process of constructing a gain amplification coefficient for the energy complementary frequency band, constructing a gain attenuation coefficient for the energy conflict frequency band, and generating a gain coefficient vector based on the gain amplification coefficient and the gain attenuation coefficient can specifically include the following steps: For the energy complementary frequency band, the first normalization coefficient is obtained by subtracting the first preset threshold from the frequency band energy difference index and dividing it by the first preset difference range. The first normalization coefficient is then multiplied by the first preset gain amplitude and added to the first reference gain value to obtain the gain amplification coefficient. For energy conflict frequency bands, the second normalization coefficient is obtained by subtracting the frequency band energy difference index from the second preset threshold and then dividing by the second preset threshold. The gain attenuation coefficient is obtained by subtracting the product of the second normalization coefficient and the second preset attenuation amplitude from the second reference gain value. A gain coefficient vector is constructed based on the gain amplification factor and the gain attenuation factor.

[0030] Specifically, for frequency regions determined to be energy complementary bands, i.e., the part where the energy difference index is greater than the first preset threshold, a set of linear normalization and gain amplification calculation processes are executed. The energy difference index value of the current frequency band is subtracted from the first preset threshold, for example, 0.55, and the difference is divided by the first preset difference range, for example, 0.45, to obtain the first normalization coefficient corresponding to the frequency band, which reflects the relative strength of the frequency band within the complementarity range, and is then used to determine the specific gain amplitude. The first normalization coefficient is multiplied by a first preset gain amplitude parameter, which defines the maximum gain enhancement range that can be applied under the condition of strongest complementarity (e.g., 0.15). The product is then added to a first reference gain value (e.g., 0.95) to obtain the gain amplification factor. The gain amplification factor increases with the enhancement of complementarity, and theoretically can reach a band gain of 1.1 times when complementarity is strongest. For areas determined to be energy conflict bands, i.e., bands where the energy difference index is lower than the second preset threshold, the opposite normalization and attenuation processing is performed, i.e., the second preset threshold (e.g., 0.25) is subtracted. The energy difference index of the current frequency band is divided by a second preset threshold to obtain a second normalized coefficient, which represents the severity of the conflict within the frequency band. This second normalized coefficient is multiplied by a second preset attenuation magnitude (e.g., set to 0.35). The product is then subtracted from the second reference gain value (e.g., set to 1.0) to obtain the gain attenuation coefficient for that frequency band. This coefficient decreases as the conflict severity increases, with a minimum attenuation of 0.65 times, to reduce the jarring effect of high-energy overlap and spectral overload in the mix. Neutral frequency bands that do not belong to either the complementary or conflict zones are uniformly assigned a gain value of 1.0, meaning no adjustment is made. The gain amplification factor, attenuation factor, and neutral retention value are organized sequentially according to the corresponding frequency bands to form a gain coefficient vector.

[0031] In one specific embodiment, the process of performing step 103 may specifically include the following steps: Calculate the cross-correlation function numerical sequence of the first and second original audio, and search for the time delay parameter corresponding to the position of the maximum value of the cross-correlation function numerical sequence as the initial phase difference; The first stable audio in the first original audio is determined according to the first cyclic stationary identifier, and the corresponding first beat period is obtained. Half of the first beat period is subtracted from the initial phase difference, multiplied by the sampling frequency, and rounded to obtain the number of sampling points shifted forward in the first time. The second stable audio in the second original audio is determined according to the second cyclic stationary identifier, and the corresponding second beat period is obtained. Half of the second beat period is subtracted from the initial phase difference, multiplied by the sampling frequency, and rounded to obtain the number of sampling points shifted forward in the second time. The first stable audio is shifted forward by the number of sampling points forward in the first time step to obtain the first compensated audio, and the second stable audio is shifted forward by the number of sampling points forward in the second time step to obtain the second compensated audio.

[0032] Specifically, a numerical sequence of cross-correlation functions reflecting the temporal correlation between the first and second original audio signals is constructed. Using sample data from both signals as input, a sliding window approach is employed to perform point-by-point multiplication and accumulation calculations between one audio signal and the other at different delays τ, generating a cross-correlation function R(τ) with respect to τ. The value of τ is limited to the rhythmic interlacing range applicable to sleep aid scenarios, for example, 50 milliseconds to 800 milliseconds. The entire delay range is scanned with each sampling point as a step unit to obtain the cross-correlation numerical sequence. The delay position τ_max corresponding to the maximum value in the cross-correlation numerical sequence is searched; this is the maximum temporal overlap point between the two audio signals in their original state, and τ_max is used as the initial phase difference. The first cyclic stationary indicator is read to determine whether the first original audio is a periodic cyclic stationary signal. If it is, the beat period value obtained from the autocorrelation analysis is extracted and recorded as the first beat period. The first beat period value is divided by 2 to obtain the half-cycle length. The initial phase difference is then subtracted from the half-cycle length value, and the result is multiplied by the audio sampling frequency (e.g., 44100Hz) and rounded to the nearest integer to obtain the first time forward sampling point number. This represents the number of samples that need to be advanced in the first stationary audio, thereby achieving time compensation for rhythm misalignment. The second cyclic stationary indicator is read. If the second audio is a periodic cyclic stationary signal, its beat period is obtained as the second beat period. The half-cycle is subtracted from the initial phase difference and multiplied by the sampling frequency in the same way to obtain the second time forward sampling point number. After obtaining the two time-shifted sampling point numbers, sample-level forward processing is performed on the first and second stable audio in sequence. All samples in each audio data segment after the current position are moved forward as a whole according to the calculated sampling point number. At the same time, the overflowing tail samples during the forward process are discarded or reset to mute to keep the sampling structure length unchanged, and the first compensated audio and the second compensated audio are output respectively.

[0033] In one specific embodiment, the process of calculating the cross-correlation function numerical sequence of the first and second original audio files, and searching for the time delay parameter corresponding to the maximum value position of the cross-correlation function numerical sequence as the initial phase difference, can specifically include the following steps: The minimum and maximum delay times within the preset delay range are multiplied by the sampling frequency to obtain the lower and upper limits of the delay sampling points. The sampling point search interval is then determined based on the lower and upper limits of the delay sampling points. For each delayed sample point value within the sampling point search interval, the sample points of the first original audio are multiplied one by one with the sample points of the corresponding delayed position of the second original audio, and then summed to obtain the cross-correlation function value corresponding to the delayed sample point value. The cross-correlation function value sequence is obtained by traversing the sampling point search interval. Find the cross-correlation function value with the largest value in the cross-correlation function value sequence, and divide the delayed sampling point value corresponding to the largest cross-correlation function value by the sampling frequency to obtain the initial phase difference.

[0034] Specifically, the preset delay time interval undergoes a sampling point conversion process. The minimum and maximum delay time values ​​within the preset delay range are multiplied by the sampling frequency of the current audio signal, for example, 44100Hz, to obtain the lower and upper limits of the delay sampling points, corresponding to the minimum and maximum sampling offset values ​​of the delay time interval, respectively. Using these two sampling point values ​​as boundaries, a sampling point search interval is constructed, containing all time offset candidate points that may lead to enhanced cross-correlation between audio signals. Cross-correlation function calculation is performed on each delayed sampling point value within the sampling point search interval. Each sampling point within the current frame range of the first original audio is multiplied one-to-one with the sample value at the corresponding delayed sampling point position in the second original audio, and all product results are summed in sample index order to obtain the cross-correlation function value corresponding to the specific delayed sampling point value. This process is repeated for each sampling point offset within the entire search interval, generating a cross-correlation function value sequence. The horizontal axis of the cross-correlation function value sequence represents the number of delayed sampling points, and the vertical axis represents the corresponding cross-correlation strength. A maximum search operation is performed in the cross-correlation function numerical sequence to find the function term with the largest value, and its corresponding delayed sample point value is recorded. This point is the optimal alignment position between the two audio streams to achieve maximum similarity or maximum temporal overlap in the current data segment. The delayed sample point value corresponding to the maximum cross-correlation function value is divided by the sampling frequency to complete the conversion from sampling point units to time units, thus obtaining the initial phase difference.

[0035] In one specific embodiment, the process of performing step 104 may specifically include the following steps: The first non-stationary audio in the first original audio is determined according to the first cyclic stationary identifier, and the second non-stationary audio in the second original audio is determined according to the second cyclic stationary identifier. Assign a first basic weighting coefficient to the first non-stationary audio and the second non-stationary audio, assign a second basic weighting coefficient to the first stationary audio and the second stationary audio, and introduce a sinusoidal time-varying modulation function to construct the first time-varying weighting coefficient and the second time-varying weighting coefficient. The first compensated audio is subjected to Fourier transform to obtain the first frequency domain data, and the second compensated audio is subjected to Fourier transform to obtain the second frequency domain data. The first weighted frequency domain data is obtained by multiplying the first time-varying weight coefficient, the first frequency domain data, and the gain coefficient vector at each frequency point; the second weighted frequency domain data is obtained by multiplying the second time-varying weight coefficient, the second frequency domain data, and the gain coefficient vector at each frequency point. The first weighted frequency domain data and the second weighted frequency domain data are summed one frequency point at a time to obtain the mixed frequency domain data, and the mixed frequency domain data are subjected to inverse Fourier transform to obtain the time domain mixed signal. A linear attenuation gain is applied to the time-domain mixed signal during the sleep induction stage, the light sleep maintenance stage, and the deep sleep stage, respectively, and the target audio signal is output.

[0036] Specifically, the presence of a significantly stable rhythmic structure in the first original audio is determined based on the first cyclic stationarity indicator. If the indicator value is 1, it indicates that the audio is a periodic cyclic stationary signal. Based on this, the remaining non-rhythmic parts that did not participate in rhythm compensation processing are extracted as the first non-stationary audio. The portion of the second original audio that lacks a periodic structure is defined as the second non-stationary audio based on the second cyclic stationarity indicator. A relatively high base weight coefficient α_base, for example, 0.58, is assigned to these two types of non-stationary audio. A lower base weight coefficient β_base, for example, 0.42, is assigned to the first and second stationary audio, i.e., the rhythmic signal segments generated after rhythm offset compensation. This ensures that natural background sounds dominate the overall mix while preserving the secondary accompanying perception of periodic rhythmic sounds. A sinusoidal time-varying modulation mechanism is introduced, constructing two weight functions that vary with playback time. The first time-varying weight coefficient is defined as α(t) = α_base + 0.08·sin(2πt / T_mod), and the second time-varying weight coefficient is defined as β(t) = β_base. 0.08·sin(2πt / T_mod), where t represents the playback time and T_mod is the modulation period (120 seconds). This modulation method causes the loudness of the two audio tracks to alternately increase and decrease periodically, meeting the sensitivity adjustment requirements for sound layer changes during sleep. After completing the dynamic modeling of the weighting coefficients, a 2048-point Fast Fourier Transform is performed on the first and second compensated audio tracks respectively to obtain the first and second frequency domain data in frequency domain representation. The first time-varying weighting coefficient is multiplied by the complex amplitude of each frequency point in the first frequency domain data, and then multiplied point by point by the gain coefficient vector to complete the spectral weighting processing of the first audio track, resulting in the first weighted frequency domain data. The second audio track is processed in the same way to obtain the second weighted frequency domain data. The two sets of frequency domain data are added point by point according to frequency points to synthesize the mixed frequency domain data, and then an inverse Fast Fourier Transform is performed to restore it to the time domain signal form, resulting in the time domain mixed signal. The playback process is divided into three consecutive stages: the sleep induction stage (0–20 minutes), the light sleep maintenance stage (20–40 minutes), and the deep sleep stabilization stage (after 40 minutes). An independent linear decay gain function is designed for each stage, with the initial gain value G0 linearly decreased by 12% and 18% respectively until a final value of 0.72G0 is reached, then maintained at a constant value, forming a three-segment sleep-inducing curve. The total gain function G(t) at the corresponding time t is applied to the time-domain mixed signal to achieve dynamic physiological rhythm adaptation control of the volume, outputting a target audio signal that meets sleep-inducing requirements.

[0037] In one specific embodiment, the process of performing the step of multiplying the first time-varying weighting coefficient, the first frequency domain data, and the gain coefficient vector frequency-by-frequency to obtain the first weighted frequency domain data, and multiplying the second time-varying weighting coefficient, the second frequency domain data, and the gain coefficient vector frequency-by-frequency to obtain the second weighted frequency domain data, can specifically include the following steps: For each first frequency point in the first frequency domain data, the first time-varying weighting coefficient is multiplied by the spectral amplitude of the first frequency point and then multiplied by the gain coefficient corresponding to the sub-frequency band to which the first frequency point belongs in the gain coefficient vector to obtain the first weighted frequency domain amplitude of the first frequency point. The first weighted frequency domain data is obtained by traversing all first frequency points. For each second frequency point in the second frequency domain data, the second time-varying weighting coefficient is multiplied by the spectral amplitude of the second frequency point, and then multiplied by the gain coefficient corresponding to the sub-frequency band to which the second frequency point belongs in the gain coefficient vector to obtain the second weighted frequency domain amplitude of the second frequency point. The second weighted frequency domain data is obtained by traversing all the second frequency points. The weighted frequency domain amplitudes of the first and second weighted frequency domain data are added together one by one to obtain the mixed frequency domain data.

[0038] Specifically, the first frequency domain data corresponding to the first compensated audio is represented as a set of frequency points, each containing a complex amplitude value reflecting the amplitude intensity and phase information of that point in the current spectrum. Each first frequency point in the set is processed one by one to extract the first time-varying weight coefficient corresponding to the current playback time. The first time-varying weight coefficient originates from a dynamic modulation model constructed based on a sine function. The time-varying weight coefficient is multiplied by the spectral amplitude of the current first frequency point to achieve time weight control of the frequency point. Based on this, the frequency band number to which the frequency point belongs is identified, that is, it is determined which segment of the 64 sub-bands its frequency value belongs to. The corresponding gain value in the gain coefficient vector is indexed according to the sub-band number, and the aforementioned product result is multiplied by the gain coefficient to obtain the first weighted frequency domain amplitude of the current first frequency point. Following this method, all frequency points are traversed, and weighting with both weight and gain is performed sequentially to construct the first weighted frequency domain data set. For the second frequency domain data of the second compensated audio, the second time-varying weight coefficient is extracted point by point. This coefficient is then multiplied by the spectral amplitude of the current frequency point and the gain coefficient of the corresponding sub-band to form the second weighted frequency domain amplitude of the second frequency point. This second weighted frequency domain data is then constructed by traversing all frequency points. The first and second weighted frequency domain data are then added point-to-point, and the two weighted frequency domain amplitudes at each frequency point are subjected to complex addition to obtain the mixed frequency domain data amplitude for the corresponding frequency point. This operation is repeated until all frequency points have been processed, forming a mixed frequency domain data set.

[0039] In one specific embodiment, the process of applying a linear attenuation gain to the time-domain mixed signal during the sleep induction stage, the light sleep maintenance stage, and the deep sleep stage, and outputting the target audio signal, can specifically include the following steps: The sleep stage is determined based on the playback time. When the playback time is between the first and second time, it is determined to be the sleep induction stage. When the playback time is between the second and third time, it is determined to be the light sleep maintenance stage. When the playback time exceeds the third time, it is determined to be the deep sleep stage. During the sleep induction phase, the first-stage gain coefficient is obtained by multiplying the initial gain value by 1, subtracting the ratio of the playback time to the duration of the first stage, and then multiplying by the first attenuation ratio. During the light sleep maintenance phase, the second-stage gain coefficient is obtained by multiplying the gain value at the end of the sleep induction phase by 1, subtracting the ratio of the playback time offset to the duration of the second stage, and then multiplying by the second attenuation ratio. During the deep sleep phase, a constant third-stage gain coefficient is used. The time-domain mixed signal is subjected to Fourier transform and low-frequency band enhancement to obtain low-frequency enhanced frequency domain data. The stage gain coefficient corresponding to the current sleep stage is multiplied with the low-frequency enhanced frequency domain data and then subjected to inverse Fourier transform to output the target audio signal.

[0040] Specifically, a management structure for playback time is established, and the audio playback process is divided into three continuous physiological rhythm stages based on playback time: the sleep induction stage, the light sleep maintenance stage, and the deep sleep stage. The first time T1 and the second time T2 are set as key time points for stage division, where T1 represents the time boundary from the start of playback to the end of the sleep induction stage, and T2 represents the end time of the light sleep maintenance stage. When the current playback time t is less than T1 (t ∈ [0, T1]), it is determined to be in the sleep induction stage; when the playback time is between T1 and T2 (t ∈ (T1, T2]), it is determined to be in the light sleep maintenance stage; and when the playback time exceeds T2 (t > T2), it enters the deep sleep stage. Based on the stage determination, different gain function strategies are designed for each stage to adapt to the changing patterns of human auditory sensitivity. In the sleep induction stage, a linear decreasing function G1(t) = G0·[1] is constructed based on the set initial gain value G0. [(t / T1)·λ1], where λ1 is the attenuation ratio of the first stage, representing the maximum attenuation amplitude within the time range of T1; in the light sleep maintenance stage, taking the gain value G1(T1) at the end of the sleep induction stage as the starting point, a new linear attenuation function G2(t)=G1(T1)·[1 ((t T1) / (T2 [T1))·λ2], where λ2 is the attenuation ratio of the second stage. This function controls the volume to decrease further in the second stage, reaching the lowest controllable value at T2. After entering the deep sleep stage, the constant third-stage gain coefficient G3=G2(T2) is maintained to keep the output volume stable and avoid disturbing deep sleep. A fast Fourier transform is performed on the time-domain mixed signal to obtain the frequency-domain signal. Low-frequency enhancement processing is performed on this frequency domain signal. By constructing a frequency enhancement function H_low(f) with a Gaussian shape centered at 150Hz, selective amplification of frequency components within the range of 150Hz± is achieved. The function form is H_low(f)=1+η·exp[ (f f0) 2 / σ 2 ], where η is the peak enhancement amplitude, f0 is the enhancement center frequency, and σ 2 By controlling the enhancement width, the enhancement function is multiplied frequency-by-frequency by the frequency domain signal to obtain the frequency domain data after low-frequency enhancement. The stage gain coefficient corresponding to the current playback time stage is multiplied frequency-by-frequency by the frequency domain data after low-frequency enhancement, and an inverse fast Fourier transform is performed to recover the target audio signal after physiological rhythm modulation and low-frequency enhancement processing.

[0041] The above describes the sleep-aid mixing control method for intelligent superposition of dual audio sources in the embodiments of the present invention. The following describes the sleep-aid mixing control device for intelligent superposition of dual audio sources in the embodiments of the present invention. Please refer to [link / reference]. Figure 2 One embodiment of the sleep-aid mixing control device with intelligent superposition of dual audio sources in this invention includes: The acquisition module 201 is used to acquire the first loop-stable identifier of the first original audio and the second loop-stable identifier of the second original audio. The calculation module 202 is used to calculate the first energy distribution value of each first sub-frequency band in the first original audio, calculate the second energy distribution value of each second sub-frequency band in the second original audio, and determine the gain coefficient vector based on the first energy distribution value and the second energy distribution value; The compensation module 203 is used to determine the first stable audio in the first original audio according to the first cyclic stable identifier and perform time forward compensation to obtain the first compensated audio, and to determine the second stable audio in the second original audio according to the second cyclic stable identifier and perform time forward compensation to obtain the second compensated audio. The output module 204 is used to mix the first compensated audio and the second compensated audio by applying a gain coefficient vector to output the target audio signal.

[0042] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0043] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0044] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A sleep-aid mixing control method for intelligent superposition of dual audio sources, characterized in that, include: Obtain the first loop stationary identifier of the first original audio and the second loop stationary identifier of the second original audio; Calculate the first energy distribution value of each first sub-band in the first original audio, calculate the second energy distribution value of each second sub-band in the second original audio, and determine the gain coefficient vector based on the first energy distribution value and the second energy distribution value; The first stable audio in the first original audio is determined according to the first cyclic stable identifier and time shift compensation is performed to obtain the first compensated audio. The second stable audio in the second original audio is determined according to the second cyclic stable identifier and time shift compensation is performed to obtain the second compensated audio. The first compensated audio and the second compensated audio are mixed using the gain coefficient vector to output the target audio signal.

2. The sleep-aid mixing control method for intelligent superposition of dual audio sources according to claim 1, characterized in that, The process of obtaining the first cyclic stability identifier of the first original audio and the second cyclic stability identifier of the second original audio includes: Calculate the first short-time energy envelope function of the first original audio and the second short-time energy envelope function of the second original audio; Autocorrelation analysis is performed on the first short-time energy envelope function to obtain the first envelope periodicity index, and autocorrelation analysis is performed on the second short-time energy envelope function to obtain the second envelope periodicity index. Based on the first envelope periodicity index, determine whether the original audio signal is a periodic or non-periodic cyclic stationary signal and record the first cyclic stationary indicator; based on the second envelope periodicity index, determine whether the original audio signal is a periodic or non-periodic cyclic stationary signal and record the second cyclic stationary indicator.

3. The sleep-aid mixing control method for intelligent superposition of dual audio sources according to claim 1, characterized in that, The steps of calculating the first energy distribution value of each first sub-frequency band in the first original audio, calculating the second energy distribution value of each second sub-frequency band in the second original audio, and determining the gain coefficient vector based on the first energy distribution value and the second energy distribution value include: The first original audio is divided into multiple first sub-bands and the sum of the squares of the spectral amplitudes of the frequency points in each first sub-band is accumulated to obtain a first energy distribution value. The second original audio is divided into multiple second sub-bands and the sum of the squares of the spectral amplitudes of the frequency points in each second sub-band is accumulated to obtain a second energy distribution value. For each sub-band, calculate the corresponding first energy distribution value and second energy distribution value to calculate the band energy difference index; The frequency band complementary frequency band is determined based on the frequency band energy difference index and the first preset threshold, and the frequency band conflicting frequency band is determined based on the frequency band energy difference index and the second preset threshold. A gain amplification factor is constructed for the energy complementary frequency band, a gain attenuation factor is constructed for the energy conflict frequency band, and a gain coefficient vector is generated based on the gain amplification factor and the gain attenuation factor.

4. The sleep-aid mixing control method for intelligent superposition of dual audio sources according to claim 3, characterized in that, The step of constructing a gain amplification factor for the energy complementary frequency band, constructing a gain attenuation factor for the energy conflict frequency band, and generating a gain coefficient vector based on the gain amplification factor and the gain attenuation factor includes: For the energy complementary frequency band, the first normalization coefficient is obtained by subtracting the first preset threshold from the frequency band energy difference index and dividing it by the first preset difference range. The first normalization coefficient is then multiplied by the first preset gain amplitude and added to the first reference gain value to obtain the gain amplification coefficient. For the energy conflict frequency band, the second normalization coefficient is obtained by subtracting the frequency band energy difference index from the second preset threshold and then dividing by the second preset threshold. The gain attenuation coefficient is obtained by subtracting the product of the second normalization coefficient and the second preset attenuation amplitude from the second reference gain value. A gain coefficient vector is constructed based on the gain amplification factor and the gain attenuation factor.

5. The sleep-aid mixing control method for intelligent superposition of dual audio sources according to claim 1, characterized in that, The step of determining the first stable audio in the first original audio based on the first cyclic stationary identifier and performing time-shift compensation to obtain the first compensated audio, and determining the second stable audio in the second original audio based on the second cyclic stationary identifier and performing time-shift compensation to obtain the second compensated audio, includes: Calculate the cross-correlation function numerical sequence of the first original audio and the second original audio, and search for the time delay parameter corresponding to the position of the maximum value of the cross-correlation function numerical sequence as the initial phase difference; Based on the first cyclic stationary identifier, the first stationary audio in the first original audio is determined and the corresponding first beat period is obtained. Half of the first beat period is subtracted from the initial phase difference, multiplied by the sampling frequency, and rounded to obtain the first time forward sampling point number. Based on the second cyclic stationary identifier, the second stationary audio in the second original audio is determined and the corresponding second beat period is obtained. Half of the second beat period is subtracted from the initial phase difference, multiplied by the sampling frequency, and rounded to obtain the second time forward sampling point number. The first stable audio is shifted forward by the number of sampling points forward according to the first time to obtain the first compensated audio, and the second stable audio is shifted forward by the number of sampling points forward according to the second time to obtain the second compensated audio.

6. The sleep-aid mixing control method for intelligent superposition of dual audio sources according to claim 5, characterized in that, The step of calculating the cross-correlation function numerical sequence of the first and second original audio files, and searching for the time delay parameter corresponding to the maximum value position of the cross-correlation function numerical sequence as the initial phase difference, includes: The minimum and maximum delay times within the preset delay range are multiplied by the sampling frequency to obtain the lower limit and upper limit of the delay sampling points, respectively. The sampling point search interval is then determined based on the lower and upper limits of the delay sampling points. For each delayed sampling point value within the sampling point search interval, the sampling points of the first original audio are multiplied one by one with the sampling points of the corresponding delayed position of the second original audio, and then summed to obtain the cross-correlation function value corresponding to the delayed sampling point value. The cross-correlation function value sequence is obtained by traversing the sampling point search interval. Find the cross-correlation function value with the largest value in the cross-correlation function value sequence, and divide the delayed sampling point value corresponding to the largest cross-correlation function value by the sampling frequency to obtain the initial phase difference.

7. The sleep-aid mixing control method for intelligent superposition of dual audio sources according to claim 6, characterized in that, The process of applying the gain coefficient vector to mix the first compensated audio and the second compensated audio to output a target audio signal includes: The first non-stationary audio in the first original audio is determined according to the first cyclic stationary identifier, and the second non-stationary audio in the second original audio is determined according to the second cyclic stationary identifier. Assign a first basic weight coefficient to the first non-stationary audio and the second non-stationary audio, assign a second basic weight coefficient to the first stationary audio and the second stationary audio, and introduce a sinusoidal time-varying modulation function to construct the first time-varying weight coefficient and the second time-varying weight coefficient. Perform a Fourier transform on the first compensated audio to obtain first frequency domain data, and perform a Fourier transform on the second compensated audio to obtain second frequency domain data; The first weighted frequency domain data is obtained by multiplying the first time-varying weight coefficient, the first frequency domain data, and the gain coefficient vector point by point; the second weighted frequency domain data is obtained by multiplying the second time-varying weight coefficient, the second frequency domain data, and the gain coefficient vector point by point. The first weighted frequency domain data and the second weighted frequency domain data are summed point by point to obtain mixed frequency domain data, and the mixed frequency domain data are subjected to inverse Fourier transform to obtain a time-domain mixed signal. A linear attenuation gain is applied to the time-domain mixed signal during the sleep induction stage, the light sleep maintenance stage, and the deep sleep stage, respectively, and the target audio signal is output.

8. The sleep-aid mixing control method for intelligent superposition of dual audio sources according to claim 7, characterized in that, The step of multiplying the first time-varying weighting coefficient, the first frequency domain data, and the gain coefficient vector frequency-by-frequency to obtain the first weighted frequency domain data, and multiplying the second time-varying weighting coefficient, the second frequency domain data, and the gain coefficient vector frequency-by-frequency to obtain the second weighted frequency domain data, includes: For each first frequency point in the first frequency domain data, the first time-varying weighting coefficient is multiplied by the spectral amplitude of the first frequency point and then multiplied by the gain coefficient corresponding to the sub-frequency band to which the first frequency point belongs in the gain coefficient vector to obtain the first weighted frequency domain amplitude of the first frequency point. The first weighted frequency domain data is obtained by traversing all first frequency points. For each second frequency point in the second frequency domain data, the second time-varying weighting coefficient is multiplied by the spectral amplitude of the second frequency point and then multiplied by the gain coefficient corresponding to the sub-frequency band to which the second frequency point belongs in the gain coefficient vector to obtain the second weighted frequency domain amplitude of the second frequency point. The second weighted frequency domain data is obtained by traversing all second frequency points. The weighted frequency domain amplitudes of the first weighted frequency domain data and the second weighted frequency domain data are added one by one to obtain the mixed frequency domain data.

9. The sleep-aid mixing control method for intelligent superposition of dual audio sources according to claim 7, characterized in that, The step of applying a linear attenuation gain to the time-domain mixed signal during the sleep induction stage, light sleep maintenance stage, and deep sleep stage, respectively, and outputting a target audio signal includes: The sleep stage is determined based on the playback time. When the playback time is between the first and second time, it is determined to be the sleep induction stage. When the playback time is between the second and third time, it is determined to be the light sleep maintenance stage. When the playback time exceeds the third time, it is determined to be the deep sleep stage. During the sleep induction phase, the first-stage gain coefficient is obtained by multiplying the initial gain value by 1, subtracting the ratio of the playback time to the duration of the first stage, and then multiplying by the first attenuation ratio. During the light sleep maintenance phase, the second-stage gain coefficient is obtained by multiplying the gain value at the end of the sleep induction phase by 1, subtracting the ratio of the playback time offset to the duration of the second stage, and then multiplying by the second attenuation ratio. During the deep sleep phase, a constant third-stage gain coefficient is used. The time-domain mixed signal is subjected to Fourier transform and low-frequency band enhancement to obtain low-frequency enhanced frequency domain data. The stage gain coefficient corresponding to the current sleep stage is multiplied with the low-frequency enhanced frequency domain data and then subjected to inverse Fourier transform to output the target audio signal.

10. A sleep-aid mixing control device with intelligent superposition of dual audio sources, characterized in that, A sleep-assisted mixing control method for performing intelligent superposition of dual audio sources as described in any one of claims 1-9, comprising: The acquisition module is used to acquire the first loop-stable identifier of the first original audio and the second loop-stable identifier of the second original audio. The calculation module is used to calculate the first energy distribution value of each first sub-frequency band in the first original audio, calculate the second energy distribution value of each second sub-frequency band in the second original audio, and determine the gain coefficient vector based on the first energy distribution value and the second energy distribution value; The compensation module is used to determine the first stable audio in the first original audio according to the first cyclic stable identifier and perform time forward compensation to obtain the first compensated audio, and to determine the second stable audio in the second original audio according to the second cyclic stable identifier and perform time forward compensation to obtain the second compensated audio. The output module is used to mix the first compensated audio and the second compensated audio using the gain coefficient vector, and output the target audio signal.