Song timbre matching method and system based on audio spectrum analysis

By using audio spectrum analysis to dynamically segment and adaptively match spectrum segments, the problem of timbre distortion caused by user singing rhythm deviations is solved, achieving accurate timbre matching and rhythm restoration while preserving the original singing style and user timbre characteristics.

CN121506191BActive Publication Date: 2026-06-26CHENGDU YINYUE CHUANGXIANG TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHENGDU YINYUE CHUANGXIANG TECH CO LTD
Filing Date
2025-12-19
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing timbre optimization methods cause distortion of sound physical characteristics and inaccurate time-domain anchoring of spectral energy peaks when the user's singing rhythm deviates from the original singer's, and cannot effectively preserve the original singer's stylistic imprint.

Method used

By using an audio spectrum analysis method, a matching spectrum segment covering the complete articulation cycle is dynamically defined. The spectrum segment is divided into front and back segments by using the point of maximum vibration energy in the fundamental frequency region. The time value difference is quantified and combined with a preset threshold for adaptive matching, and correction is triggered only when the error exceeds the tolerance.

Benefits of technology

It achieves accurate matching of user timbre and original vocals even with rhythm deviations, avoids fluctuations and distortions introduced by phrase scaling, preserves the original vocal style and user timbre individuality, and optimizes the characteristics of onset impact and tail decay.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121506191B_ABST
    Figure CN121506191B_ABST
Patent Text Reader

Abstract

The application discloses a song tone matching method and system based on audio spectrum analysis, and relates to the technical field of data processing.The method comprises the following steps: obtaining original singing audio and user audio, obtaining song sentences, obtaining first matching frequency spectrum segments corresponding to the song sentences, and obtaining second matching frequency spectrum segments corresponding to the song sentences; obtaining first frequency spectrum peak points, dividing the first matching frequency spectrum segments into first front frequency spectrum segments and first rear frequency spectrum segments, obtaining second frequency spectrum peak points, dividing the second matching frequency spectrum segments into second front frequency spectrum segments and second rear frequency spectrum segments; obtaining front frequency spectrum difference values, obtaining rear frequency spectrum difference values, and judging whether the front frequency spectrum difference values and the rear frequency spectrum difference values are smaller than a preset threshold value; if not, generating target front frequency spectrum segments and target rear frequency spectrum segments, constructing target sentence audio according to the target front frequency spectrum segments and the target rear frequency spectrum segments, and obtaining matching audio according to the target sentence audio.The application has the advantages of good song tone matching effect, rhythm deviation repair, original singing style reservation and user tone matching effect.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, and specifically to a method and system for matching song timbre based on audio spectrum analysis. Background Technology

[0002] In scenarios such as KTV systems, vocal teaching, and karaoke equipment, this method corrects users' singing rhythm deviations, simultaneously optimizes articulation and vibrato at the end of notes, and matches the dynamic characteristics of the demonstration records. It can solve problems such as amateur singers rushing the intro / dragging the chorus. Its core function is to optimize singing timbre without modeling. However, there are some problems with the matching and processing of user singing timbre in existing timbre optimization methods.

[0003] Specifically, when the user's singing rhythm deviates from the original (e.g., inconsistent articulation speed or different durations of the final note), existing timbre transfer methods directly stretch or compress the entire phrase's audio in the global temporal domain, resulting in distortion of the sound's physical characteristics. Furthermore, forcibly adjusting the duration of the entire phrase disrupts the temporal structure of acoustic events, causing a misalignment between the onset impact phase and the attenuation trajectory of the final note. For example, if the user starts singing quickly but the original singer has a slow start style, compression processing will weaken the glottal pulse intensity, producing "breathiness" distortion. Furthermore, the unified spectrum scaling process ignores the temporal anchoring effect of the spectral energy peaks, causing the formant migration trajectory to break. A typical phenomenon is that the fundamental frequency envelope is stretched and the oscillation frequency becomes inaccurate during vibrato singing, forming a mechanical sine wave. Furthermore, to avoid the above problems, existing solutions often only process the steady-state region, discarding the articulation characteristics of the beginning and end stages, resulting in the optimized singing losing the original singer's stylistic imprint. Summary of the Invention

[0004] To address the technical problem that existing technologies lack the ability to decouple audio physiological temporal structure and that there is a contradiction between rhythm correction and timbre preservation, this invention provides a song timbre matching method and system based on audio spectrum analysis.

[0005] A song timbre matching method based on audio spectrum analysis includes: acquiring the original vocal audio and user audio of a target song, acquiring multiple song phrases corresponding to the target song, acquiring a first matching spectrum segment corresponding to each song phrase based on the original vocal audio, and acquiring a second matching spectrum segment corresponding to each song phrase based on the user audio; acquiring the first spectral peak point of the first matching spectrum segment, dividing the first matching spectrum segment into a first pre-spectral segment and a first post-spectral segment based on the first spectral peak point, acquiring the second spectral peak point of the second matching spectrum segment, and dividing the second matching spectrum segment into a second pre-spectral segment and a second post-spectral segment based on the second spectral peak point; and so on. The difference between the first and second front spectrum segments is obtained, and the difference between the first and second back spectrum segments is obtained. It is then determined whether both the difference between the front and back spectrum segments are less than a preset threshold. If both the difference between the front and back spectrum segments are not less than the preset threshold, the duration of the second front spectrum segment is adaptively matched based on the difference between the front spectrum segments to generate a target front spectrum segment. The duration of the second back spectrum segment is adaptively matched based on the difference between the back spectrum segments to generate a target back spectrum segment. The target front spectrum segment and the target back spectrum segment are used to construct the target sentence audio corresponding to the song sentence. Finally, the matched audio is obtained based on the target sentence audio corresponding to multiple song sentences.

[0006] Optionally, obtaining the first matching spectrum segment corresponding to each song phrase based on the original audio includes: detecting the fundamental frequency region in the first matching spectrum segment corresponding to the i-th song phrase; extending the boundary before and after the fundamental frequency region corresponding to the i-th song phrase until it extends to the spectral transition point, and taking the spectrum segment between the spectral transition points before and after the fundamental frequency region as the first matching spectrum segment corresponding to the i-th song phrase.

[0007] Optionally, obtaining the first spectral peak point of the first matched spectrum segment includes: obtaining the fundamental frequency region in the first matched spectrum segment, obtaining the point of maximum vibration energy in the fundamental frequency region, and taking the point of maximum vibration energy as the first spectral peak point.

[0008] Optionally, obtaining the front spectrum difference based on the first front spectrum segment and the second front spectrum segment includes: obtaining the duration of the first front spectrum segment and obtaining the duration of the second front spectrum segment; and using the absolute value of the difference between the duration of the first front spectrum segment and the duration of the second front spectrum segment as the front spectrum difference.

[0009] Optionally, adaptively matching the duration of the second front spectrum segment based on the front spectrum difference and generating the target front spectrum segment includes: if the duration of the first front spectrum segment exceeds the duration of the second front spectrum segment, then obtaining a front reduction factor based on the front spectrum difference, obtaining a front matching duration based on the front reduction factor and the duration of the second front spectrum segment, and compressing the second front spectrum segment according to the front matching duration to generate a target front spectrum segment with a duration equal to the front matching duration; if the duration of the first front spectrum segment does not exceed the duration of the second front spectrum segment, then obtaining a front amplification factor based on the front spectrum difference, obtaining a front matching duration based on the front amplification factor and the duration of the second front spectrum segment, and amplifying the second front spectrum segment according to the front matching duration to generate a target front spectrum segment with a duration equal to the front amplification factor.

[0010] Optionally, adaptively matching the duration of the second post-spectrum segment based on the post-spectrum difference and generating the target post-spectrum segment includes: if the duration of the first post-spectrum segment exceeds the duration of the second post-spectrum segment, then obtaining a post-shortening coefficient based on the post-spectrum difference, obtaining a post-matching duration based on the post-shortening coefficient and the duration of the second post-spectrum segment, and compressing the second post-spectrum segment according to the post-matching duration to generate a target post-spectrum segment with a duration equal to the post-matching duration; if the duration of the first post-spectrum segment does not exceed the duration of the second post-spectrum segment, then obtaining a post-amplification coefficient based on the post-spectrum difference, obtaining a post-matching duration based on the post-amplification coefficient and the duration of the second post-spectrum segment, and amplifying the second post-spectrum segment according to the post-matching duration to generate a target post-spectrum segment with a duration equal to the post-amplification coefficient.

[0011] A song timbre matching system based on audio spectrum analysis is also provided. The system includes: a data acquisition module, used to acquire the original vocal audio and user audio of a target song, acquire multiple song phrases corresponding to the target song, and acquire a first matching spectrum segment corresponding to each song phrase based on the original vocal audio, and acquire a second matching spectrum segment corresponding to each song phrase based on the user audio; and a data processing module, used to acquire the first spectral peak point of the first matching spectrum segment, divide the first matching spectrum segment into a first pre-spectral segment and a first post-spectral segment based on the first spectral peak point, acquire the second spectral peak point of the second matching spectrum segment, and divide the second matching spectrum segment into a second pre-spectral segment and a second post-spectral segment based on the second spectral peak point. The system comprises: a data analysis module, used to obtain the difference between the first and second pre-spectral segments and the difference between the first and second post-spectral segments, and to determine whether both the difference between the pre-spectral segments and the difference between the post-spectral segments are less than a preset threshold; and a matching generation module, used to adaptively match the duration of the second pre-spectral segment based on the difference between the pre-spectral segments and generate a target pre-spectral segment if the difference between the pre-spectral segments and the difference between the post-spectral segments are not both less than the preset threshold, adaptively match the duration of the second post-spectral segment based on the difference between the pre-spectral segments and generate a target post-spectral segment based on the difference between the post-spectral segments, construct the target sentence audio corresponding to the song sentence based on the target pre-spectral segments and the target post-spectral segments, and obtain the matched audio based on the target sentence audio corresponding to multiple song sentences.

[0012] Optionally, the data acquisition module is further configured to: detect the fundamental frequency region in the first matching spectrum segment corresponding to the i-th song phrase; extend the boundary before and after the fundamental frequency region corresponding to the i-th song phrase until it extends to the spectrum transition point, and take the spectrum segment between the spectrum transition points before and after the fundamental frequency region as the first matching spectrum segment corresponding to the i-th song phrase.

[0013] Optionally, the data analysis module is further configured to: obtain the duration of the first front spectrum segment and the duration of the second front spectrum segment; and use the absolute value of the difference between the duration of the first front spectrum segment and the duration of the second front spectrum segment as the front spectrum difference.

[0014] Optionally, the matching generation module is further configured to: if the duration of the first front spectrum segment exceeds the duration of the second front spectrum segment, obtain a front reduction factor based on the front spectrum difference, obtain a front matching duration based on the front reduction factor and the duration of the second front spectrum segment, and compress the second front spectrum segment according to the front matching duration to generate a target front spectrum segment with a duration equal to the front matching duration; if the duration of the first front spectrum segment does not exceed the duration of the second front spectrum segment, obtain a front amplification factor based on the front spectrum difference, obtain a front matching duration based on the front amplification factor and the duration of the second front spectrum segment, and amplify the second front spectrum segment according to the front matching duration to generate a target front spectrum segment with a duration equal to the front amplification factor.

[0015] The beneficial effects of this invention are reflected in:

[0016] In the entire song timbre matching method based on audio spectrum analysis, firstly, a matching spectrum segment covering the complete articulation cycle is dynamically defined. Even if the singer rushes or drags the beat, causing timing deviations, a precise matching unit can still be established by aligning the acoustic events of the onset exhalation phase and the end-note decay phase, thus mitigating spectrum segment misalignment caused by rhythm deviations to a certain extent. Furthermore, using the point of maximum vibration energy in the fundamental frequency region as an anchor point, the spectrum segment is divided into a front segment and a rear segment, forming independent correction units with physiological isomorphism. Finally, by quantifying the timing anomalies in the front and rear segments... By combining preset thresholds to achieve dynamic decision-making, correction is triggered only when the duration of any unit exceeds the tolerance, preserving the original timbre dynamics of the range that does not exceed the tolerance, and avoiding the fluctuations and breathiness distortion introduced by the scaling of the whole phrase in the existing solution. Furthermore, the first stage correction ends at the peak point to protect the impact intensity of the onset, and the second stage correction starts at the peak point to maintain the continuity of the formant. The first stage compression / amplification only changes the duration of the initial consonant burst without diluting the peak energy of the spectrum, and the second stage time value adjustment avoids the fundamental frequency oscillation frequency inaccuracy by proportionally shrinking the amplitude envelope, thus achieving the optimized effect of no collapse of the impact phase and no break in the attenuation trajectory. Attached Figure Description

[0017] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the accompanying drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. In all the drawings, similar elements or parts are generally identified by similar reference numerals. In the drawings, the elements or parts are not necessarily drawn to scale.

[0018] Figure 1 This is a partial flowchart of the song timbre matching method based on audio spectrum analysis of the present invention;

[0019] Figure 2 This is a schematic diagram of another part of the song timbre matching method based on audio spectrum analysis of the present invention;

[0020] Figure 3 This is a schematic diagram of another part of the song timbre matching method based on audio spectrum analysis of the present invention;

[0021] Figure 4 This is a schematic diagram illustrating the steps of the song timbre matching method based on audio spectrum analysis of the present invention;

[0022] Figure 5 This is a schematic diagram of part of step S1 in the song timbre matching method based on audio spectrum analysis of the present invention;

[0023] Figure 6 This is a schematic diagram of part of step S3 in the song timbre matching method based on audio spectrum analysis of the present invention. Detailed Implementation

[0024] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.

[0025] Therefore, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the invention without inventive effort are within the scope of protection of the invention.

[0026] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, the terms "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0027] like Figure 1 , Figure 2 , Figure 3 and Figure 4 As shown, a song timbre matching method based on audio spectrum analysis is provided. In one embodiment, the method includes:

[0028] S1. Obtain the original audio and user audio of the target song, and obtain multiple song phrases corresponding to the target song. Obtain the first matching spectrum segment corresponding to each song phrase based on the original audio, and obtain the second matching spectrum segment corresponding to each song phrase based on the user audio.

[0029] S2. Obtain the first spectral peak point of the first matching spectrum segment, and divide the first matching spectrum segment into a first front spectrum segment and a first rear spectrum segment according to the first spectral peak point. Obtain the second spectral peak point of the second matching spectrum segment, and divide the second matching spectrum segment into a second front spectrum segment and a second rear spectrum segment according to the second spectral peak point.

[0030] S3. Obtain the front spectrum difference based on the first front spectrum segment and the second front spectrum segment (the front spectrum difference is actually the difference in front duration between the duration of the first front spectrum segment and the duration of the second front spectrum segment), and obtain the back spectrum difference based on the first back spectrum segment and the second back spectrum segment, and determine whether the front spectrum difference and the back spectrum difference are both less than the preset threshold.

[0031] S4. If the difference between the front spectrum and the difference between the back spectrum are not both less than the preset threshold, then the duration of the second front spectrum segment is adaptively matched according to the difference between the front spectrum and a target front spectrum segment is generated. The duration of the second back spectrum segment is adaptively matched according to the difference between the back spectrum and a target back spectrum segment is generated. The target sentence audio corresponding to the song sentence is constructed according to the target front spectrum segment and the target back spectrum segment. The matched audio is obtained according to the target sentence audio corresponding to multiple song sentences.

[0032] In this embodiment, it should be noted that in S1, the physiological acoustic feature segments of the original audio and the user's singing are extracted. This method first delineates timing units based on the song's natural phrasing structure (such as lyrics or melodic breathing points) to ensure that each processing unit corresponds to a complete articulation cycle. The key innovation lies in not directly extracting phrase time segments, but dynamically locating acoustic boundaries by detecting spectral jump events: for example, for a lyric "PLMN", the fundamental frequency stable region of that phrase in the original audio is first identified (e.g., the continuous vibration segment of the vowel / iŋ / in the "P" syllable), then traced back to the glottal impact starting point (e.g., the explosive energy surge of the initial consonant / m / in the "L" syllable), and extended backward to the breath decay endpoint (e.g., the final flutter before the amplitude of the final sound / oʊ / in the "M" syllable reaches zero), thereby defining the first matching spectral segment that completely covers the articulation process. The same operation is performed synchronously on the user's audio; even if there are deviations in the singing rhythm (e.g., premature start due to rushing the beat), alignment units can be established through acoustic event comparison rather than fixed timestamps.

[0033] Furthermore, this process effectively addresses the shortcomings of traditional methods that neglect dynamic articulation characteristics. For example, when a user sings the chorus "EFGHIJK" of the song "ABCD," if the "E" note ends prematurely, traditional global stretching would compress the original singer's 3-second decaying vibrato to 2 seconds, resulting in spectral energy collapse and distortion. This solution, however, accurately identifies the short, abrupt ending note segment of the actual singing as the second matching spectral segment by detecting the fundamental frequency region of the "G" note in the user's audio (e.g., core vowels, such as / ɪŋ / ) and the decay jump point (the spectral break where vocal cord vibration stops). Simultaneously, it locates the corresponding complete decay trajectory segment of the original singer as a benchmark. By focusing on physiological articulation units rather than fixed durations, it preserves the user's unique onset explosiveness (such as the transient characteristics of the alveolar fricative "E") and provides precise acoustic anchors for subsequent prosodic correction, avoiding disruption of the energy transition logic between syllables.

[0034] In S2, the dynamic unit of vocalization is segmented through an acoustic energy hub to achieve spatiotemporal decoupling of the physiological structure. First, the first matched spectrum segment (original vocals) and the second matched spectrum segment (user vocals) obtained in S1 are extracted. In each spectrum segment, the global highest point of vibration energy in the fundamental frequency range (i.e., the peak point of the spectrum) is located. This point is essentially the acoustic core anchor point of articulation, representing the resonant state with the most stable breath support and the most complete vocal cord closure (e.g., the peak intensity of the vowel core vibration of "L" in the phrase "KLM" in a song). Subsequently, using this peak point as a precise cutting point, the original vocal spectrum segment is divided into the first pre-spectral segment (covering the onset impact stage from the burst of the initial consonant "K" to the peak energy of the vowel "L") and the first post-spectral segment (continuing the vibrato trajectory from the peak of "L" to the decay of the final sound of "M"). The system performs the same segmentation operation on the corresponding phrases of the user's singing, and even if there is a deviation in the singing rhythm (such as rushing the beat and causing the "K" to start singing earlier), it can still complete the segmentation of the front and back segments based on the objective physical marker of energy peak, ensuring that the second front / back spectrum segment of the user's audio maintains the topological isomorphism of the physiological pronunciation stage with the original singer.

[0035] Furthermore, S2 avoids the phase distortion problem of traditional global stretching to a certain extent. Taking the song refrain "PQR" as an example: when the user sings the last note "R", the vibrato is prolonged due to the dragging beat. Although the energy peak point of the second matched spectrum segment (corresponding to the core of the "Q" vowel) is consistent with the original singer's position, the decay time of the latter part is much longer than that of the original singer. At this time, only segmentation is performed without processing - the first latter spectrum segment maintains the original singer's 1.5-second natural decay trajectory, and the second latter spectrum segment retains the user's actual 2-second dragging beat data. Through the rigid positioning of the peak point, the glottal pulse of the first part (the instantaneous spectral transition of the "P" plosive) and the vibrato envelope of the latter part (the fundamental frequency oscillation decay of the "R") are decoupled into independent correction units. This prevents the destruction of the transient impact characteristics of "P" when compressing the whole phrase by traditional methods (such as the loss of the initial consonant fricative due to compression of the first part) or the distortion of the vibrato oscillation logic of "R" (such as the doubling of the oscillation frequency due to stretching of the latter part). Thus, the physiological continuity of the timbre dynamics is completely preserved before time-domain correction.

[0036] Further, let's illustrate with an example, assuming the song is segmented as "KLM". "K": represents the starting word (e.g., a plosive initial consonant), "L": represents the core word (the steady-state vibration segment of the vowel), "M": represents the ending word (the attenuation segment of the final note), and the spectral peak point is located at the highest energy peak in the fundamental frequency region of "L". The preceding spectral segment is from the starting point of "K" to the peak point of "L" (covering the initial consonant impact); the following spectral segment is from the peak point of "L" to the ending point of "M" (covering the attenuation of the final note). This segmentation mechanism completely separates the instantaneous plosive characteristics of "K" from the continuous oscillation characteristics of "M", laying the structural foundation for the subsequent independent duration correction of S3-S4.

[0037] In S3, timbre preservation and rhythm correction are performed by quantifying the temporal differences of the physiological articulation units. A refined temporal comparison is conducted on the acoustic units decoupled in S2 (the first segment represents the onset impact phase, and the second segment represents the tail decay phase). The dynamics of the original singer's and the user's initial articulation are compared to obtain the duration fluctuations of the first pre-spectral segment (the original singer's onset impact phase, such as the interval from the "K" initial consonant to the "L" peak in the lyrics "KLM") and the second pre-spectral segment (the user's corresponding interval), generating a pre-spectral difference. This difference reflects deviations in the user's articulation speed (such as compression of the initial segment due to rushing). Simultaneously, the tail decay trajectory is compared to obtain the duration offsets of the first post-spectral segment (the original singer's tail note segment, such as the interval from the "L" peak to the "M" tail decay) and the second post-spectral segment (the user's tail note segment), generating a post-spectral difference. This value captures anomalies in the user's tail note control (such as prolongation of the post-note segment due to dragging the beat).

[0038] Furthermore, the two differences are compared with preset thresholds. If both are less than the threshold (e.g., the user's onset rate and end note decay are very close to the original singer's), the correction process is skipped, and the user's original vocal characteristics are directly preserved to avoid over-processing and introducing distortion. If either difference exceeds the threshold, S4's adaptive correction is triggered (e.g., the latter part is significantly longer due to dragging), but only the excessive units are processed, not the entire sentence. The preset thresholds can be trained using an audio feature library and calibrated using acoustic experiments. Then, combined with S4, the threshold updates are fed back to form a closed-loop adjustment mechanism.

[0039] Furthermore, let's take the song phrase "PQR" as an example. Scenario 1: The user sings the "P" note prematurely (the first part is 20% shorter than the original), but the ending "R" note is handled accurately (the second part is close to the original). In this case, the difference in the first spectral frequency exceeds the standard, while the difference in the second spectral frequency does not. Only the first part (the initial consonant impact phase of the "P" note) is compressed and corrected, while the second part (the attenuation of the vibrato of the "R" note) remains unchanged, preserving the natural ending note. Scenario 2: The user drags out the "Q" note, causing the peak point to shift backward, thus prolonging the second part while the first part remains normal. Only the second part (the ending "R" note) is compressed, while the first part (from the plosive of the "P" note to the peak point of the "Q" note) remains unaffected, preventing correction contamination of transient acoustic events (such as the transient characteristics of the alveolar fricative of the "P" note).

[0040] Furthermore, even if the total duration deviation of the entire phrase is small (e.g., the user's total time difference is only 3%), if there is a significant deviation in the latter part (e.g., dragging the beat causes the ending note to be extended by 15%), the problematic unit is still accurately identified; when the user's onset impact time is slightly different from the original (e.g., the difference in the first part is below the threshold), the original timbre dynamics are maintained (e.g., the unique hard onset style of "K" is preserved) to avoid destroying the singing personality.

[0041] In S4, rhythm deviation repair and dynamic timbre preservation are synergistically optimized through physiological unit separation correction. S4 is only triggered when S3 determines that correction is needed (i.e., the difference between the front and back spectra exceeds the threshold). Its innovation lies in the independent processing of the onset impact segment (front segment) and the tail decay segment (back segment) in the time domain.

[0042] In the pre-singing correction, to address pronunciation rate deviations (such as premature phrasing leading to an excessively short pre-singing duration), the difference in pre-singing duration between the original vocals and the user (pre-spectral difference) is dynamically sensed. Using the peak point of the spectrum as the acoustic anchor point, only the user's pre-singing duration (from the initial consonant burst to the energy peak) is amplified or compressed. If the user starts singing too quickly (pre-singing duration < original vocals), the pre-singing spectrum is amplified proportionally to restore the glottal pulse intensity (such as the transient energy of the alveolar fricative in the word "K" in the lyrics "KLM"); if the user starts singing too slowly, the spectrum is compressed in the opposite direction to avoid dragging and distortion.

[0043] In the post-consonance correction logic, for abnormal tail note durations (such as excessively long post-consonance due to dragging beats), the same difference sensing mechanism is used to adjust the user's post-consonance (from the peak point to the tail note decay range). When the tail note drags beats, its vibrato decay segment is compressed, but the continuity of the formant trajectory is maintained (such as the fundamental frequency oscillation envelope of the "R" letter); when the tail note is short, the decay is extended to prevent energy collapse.

[0044] Furthermore, taking the song phrase "PQR" as an example. Scenario 1 (correction of the first part only): The user's singing rushes the beat, causing compression of the first part of the "P". S4 only amplifies the duration of the first part (from the plosive "P" to the peak of "Q"), restoring the impact of the plosive; the ending "R" remains unchanged because its duration does not exceed the limit, and its natural vibrato frequency (such as 5Hz physiological oscillation) is completely undisturbed. Scenario 2 (correction of the last part only): The user drags the beat of the "R", causing the ending sound to be prolonged. Only the spectrum of the last part is compressed (from the peak of "Q" to the end of the decay of "R"), precisely compressing the vibrato duration while maintaining its oscillation frequency unchanged; the onset characteristics of the first part from "P" to the peak of "Q" (such as the instantaneous closure noise of the vocal cords) are preserved without intervention. In summary, all corrections use the peak point of the spectrum as a rigid benchmark (such as the vowel resonance core of "Q"). The first correction ends at this point, and the second correction begins at this point, ensuring that the trajectory of the core resonance peak is not distorted. The first correction focuses on the transient characteristics of the onset impact (such as the 3-millisecond plosive pulse of "P"), while the second correction locks in the gradual characteristics of the tail decay (such as the gradually changing curve of the vibrato amplitude of "R"). The physical parameters of the two do not interfere with each other. When the duration of a certain segment does not exceed the limit (such as the latter part of scenario 1), the user's original vocal cord vibration pattern (such as hoarse timbre or clear sound quality) is completely preserved, avoiding excessive mechanical processing.

[0045] In summary, the entire song timbre matching method based on audio spectrum analysis firstly dynamically defines a matching spectrum segment covering the complete articulation cycle. Even if the singer rushes or drags the beat, causing timing shifts, a precise matching unit can still be established by aligning the acoustic events of the onset exhalation phase and the end-note decay phase, thus mitigating spectrum segment misalignment caused by rhythm deviations to a certain extent. Furthermore, using the point of maximum vibration energy in the fundamental frequency region as an anchor point, the spectrum segment is divided into a front and a back segment, forming independent correction units with physiological isomorphism. Finally, by quantifying the time value differences between the front and back segments... Typically, dynamic decision-making is achieved by combining preset thresholds. Correction is triggered only when the duration of any unit exceeds the tolerance, preserving the original timbre dynamics within the acceptable range and avoiding fluctuations and breathiness distortion introduced by the scaling of entire phrases in existing solutions. Furthermore, the initial correction terminates at the peak point to protect the impact intensity of the onset, while the subsequent correction begins at the peak point to maintain the continuity of the formant. The initial compression / amplification only changes the duration of the initial consonant burst without diluting the peak energy of the spectrum, and the subsequent duration adjustment avoids fundamental frequency oscillation inaccuracies by proportionally shrinking the amplitude envelope, achieving an optimized effect of no collapse of the impact phase and no break in the attenuation trajectory. In summary, under the condition of no modeling, it not only repairs rhythmic deviations (physical correction of compressing the duration of the initial segment by rushing the beat and extending the duration of the subsequent segment) but also completely preserves the original singer's style imprint (onset bursts, vibrato at the end of the note) and the user's timbre personality (zero intervention in vocal cord vibration mode).

[0046] like Figure 5 As shown, in one embodiment, S1, obtaining the first matching spectrum segment corresponding to each song phrase based on the original audio, includes:

[0047] S11. Detect the fundamental frequency region in the first matching spectrum segment corresponding to the i-th song phrase;

[0048] S12. Extend the boundary before and after the fundamental frequency region corresponding to the i-th song phrase until it reaches the spectral transition point, and take the spectral segment between the spectral transition points before and after the fundamental frequency region as the first matching spectral segment corresponding to the i-th song phrase.

[0049] In this embodiment, it should be noted that in S11, the fundamental frequency region (acoustic steady-state region) is located by detecting fundamental frequency stability. In the processing of the song phrase "KLM", the spectral segment of the corresponding phrase in the original audio is first scanned, and the steady-state vibration region is identified based on the fundamental frequency fluctuation threshold. This region must meet the core conditions: the fundamental frequency vibration trajectory is stable (usually excluding the transition zone of the onset plosive and the tail decay), and the energy fluctuation amplitude is lower than the set tolerance value (reflecting the continuity of breath support). For example, in the vowel pronunciation segment corresponding to the letter "L", the segment with fundamental frequency components (such as the derivative change rate of fundamental frequency F0 being continuously lower than the threshold) and stable harmonic structure in its spectrum is detected. This is the steady-state core region of physiological pronunciation (equivalent to the most regular resonance stage of vocal cord vibration). This operation excludes the vibrato modification zone (such as the tail oscillation of "M") or transient interference (such as the initial consonant burst of "K"), providing acoustic anchor points for subsequent boundary expansion.

[0050] In S12, a complete vocal unit is dynamically constructed based on acoustic events, extending from the fundamental frequency region located in S11 (such as the stable vowel segment of "L") to both ends of the time sequence, and the physiological boundary is determined by tracking the spectral energy mutation events.

[0051] Furthermore, by scanning backward from the starting point of the fundamental frequency region, the abrupt change point where the spectral energy gradient jumps from zero to sustained vibration is captured (corresponding to the physical characteristics of the initial glottal burst). For example, at the beginning of the "K" sound, when the spectral energy jumps from the ambient noise level (-80dB) to a stable amplitude (-20dB) within 3 milliseconds, this jump point is marked as the acoustic starting boundary (completely covering the removal phase of the initial consonant / k / ).

[0052] Furthermore, by scanning forward from the end of the fundamental frequency region, the point where the spectral energy collapses from steady-state vibration to respiratory noise (corresponding to the physiological signal of vocal cord vibration cessation) is captured. For example, at the end of the "M" shape, when the fundamental frequency energy envelope amplitude decays to less than 10% of the peak value and respiratory noise (a sudden drop in high-frequency energy) appears, this transition point is marked as the termination boundary.

[0053] Ultimately, the matched spectrum segment (starting transition point → ending transition point) fully preserves the dynamic characteristics of the articulation process: it covers the transient explosive energy impact of "K", the steady-state resonance peak trajectory of "L", and extends to the gradually weakening vibrato oscillation of "M", achieving full-cycle coverage driven by acoustic events.

[0054] It should also be noted that the method for obtaining the first matching spectrum segment corresponding to each song phrase can be the same as the method for obtaining the first matching spectrum segment corresponding to each song phrase.

[0055] In one implementation, obtaining the first spectral peak point of the first matched spectral segment in S2 includes:

[0056] Obtain the fundamental frequency region in the first matched spectrum segment, and obtain the point of maximum vibration energy in the fundamental frequency region, and take the point of maximum vibration energy as the first spectrum peak point.

[0057] In this embodiment, it should be noted that, firstly, within the first matching spectrum segment extracted by S1 (such as the spectrum segment covering the complete articulation cycle in the song phrase "PQR"), the fundamental frequency region is precisely locked. This region must satisfy the condition that the fundamental frequency trajectory is stable (excluding transitional areas such as onset plosives and tail decay). Its stability is reflected in the fact that the rate of change of the derivative of vocal cord vibration frequency is continuously lower than the preset tolerance threshold, representing the physiological homeostatic core with the most sufficient breath support and the most stable harmonic structure (for example, the vowel pronunciation segment corresponding to the letter "Q", whose fundamental frequency fluctuation rate is lower than the threshold to form a continuous resonance platform). Subsequently, the energy distribution of the entire frequency band is scanned within this fundamental frequency region to identify the point of maximum vibration energy. This point is the global peak of the spectral amplitude, which is essentially the peak of resonance intensity with the most complete vocal cord closure and the strongest airflow drive (such as the instantaneous energy burst point at the moment of glottal closure in the core of the "Q" vowel, whose amplitude is significantly higher than the harmonic energy at adjacent moments). Finally, it is marked as the spectral peak point, serving as a rigid acoustic anchor point for the segmentation of the pre / post-segment spectrum. This choice ensures that the peak point simultaneously carries the stability of physiological pronunciation (fundamental frequency region screening to exclude vibrato disturbances) and energy dominance (maximum point captures the core of resonance intensity), providing an objective physical benchmark for subsequent segmentation. For example, in the phrase "PQR", the vowel energy peak of "Q" is used as the boundary to completely decouple the plosive transient of the initial consonant "P" from the oscillation decay of the final consonant "R" in the time domain.

[0058] It should also be noted that the method for obtaining the second spectral peak point of the second matching spectrum segment can be the same as the method for obtaining the first spectral peak point of the first matching spectrum segment.

[0059] like Figure 6 As shown, in one embodiment, obtaining the front spectrum difference based on the first front spectrum segment and the second front spectrum segment in S3 includes:

[0060] S31. Obtain the duration of the first front spectrum segment and the duration of the second front spectrum segment;

[0061] S32. The absolute value of the difference between the duration of the first pre-spectral segment and the duration of the second pre-spectral segment is used as the pre-spectral difference.

[0062] In this embodiment, it should be noted that in S31, the temporal differences of the physiological articulation units are quantified to accurately measure the difference in physiological articulation duration between the original singer and the user during the onset impact phase. A high-precision duration measurement is performed on the first pre-spectral segment (the onset impact phase from the original singer's "P" initial consonant to the peak of "Q") in the song phrase "PQR." This duration reflects the original singer's standard vocal rhythm (such as the complete process of vocal cord closure, airflow buildup, and explosive release). Simultaneously, the duration of the second pre-spectral segment (the actual onset segment from the user's "P" to the peak of "Q") in the corresponding phrase is measured. The core comparison object is not the entire phrase audio, but rather the independent acoustic units decoupled in S2, ensuring that the detection results purely reflect the physiological deviation in articulation speed. This duration difference is directly related to the speed of glottal muscle movement (e.g., if the user rushes the beat, the plosive process of "P" will be compressed, shortening the vocal cord closure period), serving as the physical basis for correcting rhythmic deviations.

[0063] In S32, the correction direction interference is avoided by calculating the absolute difference between the two durations (pre-spectral difference), focusing only on the physical magnitude of the deviation. If the duration of the syllable "P" in a user's rendition of "PQR" is shorter (rushed) or longer (slower) than the original recording, correction is required as long as the absolute difference exceeds the threshold. For example, if the user shortens the duration of the initial segment due to rushing, the specific percentage of shortening is not considered; only whether the physical amount of shortening reaches the distortion threshold is judged (e.g., if the threshold is set to 100 milliseconds, a shortening of 90 milliseconds by the user is ignored, while a shortening of 150 milliseconds triggers S4 correction).

[0064] It should also be noted that the method for obtaining the spectral difference after the event can be the same as the method for obtaining the spectral difference before the event.

[0065] In one implementation, S4, adaptively matching the duration of the second pre-spectral segment based on the pre-spectral difference and generating the target pre-spectral segment, includes:

[0066] S41. If the duration of the first front spectrum segment exceeds the duration of the second front spectrum segment, then obtain the front reduction coefficient based on the front spectrum difference, and obtain the front matching duration based on the front reduction coefficient and the duration of the second front spectrum segment, and compress the second front spectrum segment according to the front matching duration to generate a target front spectrum segment with a duration of the front matching duration.

[0067] S42. If the duration of the first front spectrum segment does not exceed the duration of the second front spectrum segment, then obtain the front amplification factor based on the front spectrum difference, obtain the front matching duration based on the front amplification factor and the duration of the second front spectrum segment, and amplify the second front spectrum segment according to the front matching duration to generate a target front spectrum segment with a duration equal to the front amplification factor.

[0068] In this embodiment, it should be noted that in S41, when the original vocal range is longer than the user's (e.g., the user's premature singing causes the interval from the "P" initial consonant to the "Q" peak to be compressed), a pre-shortening coefficient is calculated based on the pre-spectral difference, and the user's initial range is spectrally compressed. This operation uses the spectral peak point as a rigid endpoint (e.g., the vowel resonance core of "Q"), only shrinking the time domain without changing the energy distribution structure: for example, in the song "PQR", if the user sings the "P" too quickly, the plosive phase of the initial consonant is shortened. S41 restores it to the original vocal range proportion through compression, while protecting the plosive phase (e.g., the 3-millisecond instantaneous energy pulse of / p / ) from being diluted, avoiding the breathiness distortion caused by traditional global stretching.

[0069] In S42, when the user's on-note duration is longer than the original vocal (e.g., a prolonged "K" initial consonant friction period), the spectral time domain is extended using a pre-amplification factor generated based on the difference. This extension is strictly limited to the on-note impact phase (from the "K" initial consonant to the "L" peak), and the duration is filled using an interpolation algorithm while maintaining the continuity of the spectral envelope. For example, if the user sings "KLM" with a prolonged "K," S42 accurately restores the original rhythm of the vocal cord closure noise, while the formant trajectory after the peak point (the "L" vowel core) remains undisturbed.

[0070] In one implementation, S4, adaptively matching the duration of the second subsequent spectrum segment based on the subsequent spectrum difference and generating the target subsequent spectrum segment, includes:

[0071] S43. If the duration of the first subsequent spectrum segment exceeds the duration of the second subsequent spectrum segment, then obtain the subsequent reduction coefficient based on the difference in the subsequent spectrum, and obtain the subsequent matching duration based on the subsequent reduction coefficient and the duration of the second subsequent spectrum segment, and compress the second subsequent spectrum segment according to the subsequent matching duration to generate a target subsequent spectrum segment with a duration equal to the subsequent matching duration.

[0072] S44. If the duration of the first post-spectrum segment does not exceed the duration of the second post-spectrum segment, then obtain the post-amplification factor based on the post-spectrum difference, obtain the post-matching duration based on the post-amplification factor and the duration of the second post-spectrum segment, and amplify the second post-spectrum segment according to the post-matching duration to generate a target post-spectrum segment with a duration equal to the post-amplification factor.

[0073] In this embodiment, it should be noted that in S43, for cases where the duration of the user's trailing note exceeds that of the original vocals (e.g., a prolonged "R" attenuation due to a dragging beat), a reduction factor is calculated based on the difference in the subsequent spectrum, and the user's later segment is compressed starting from the peak point (e.g., from the peak of "Q" to the end point of "R"). The correction process maintains the physical laws of vibrato oscillation: in the song "PQR", when the user's "R" trailing note is prolonged, S43 proportionally compresses the amplitude attenuation curve but keeps the oscillation frequency unchanged (e.g., the original vocals' 4Hz vibrato remains 4Hz), preventing mechanical sine wave distortion caused by traditional compression.

[0074] In S44, when the user's trailing note is shorter than the original vocal (e.g., due to premature recording causing incomplete attenuation of "M"), an amplification factor is generated based on the difference, and time-domain extension is performed after the peak point of the spectrum. The extension process inherits the gradual logic of the original attenuation trajectory: for example, if the trailing note of "M" is too short when the user sings "KLM", S44 extends the attenuation period of the vocal cord tremor, naturally extending it by matching the slope of the original vocal amplitude envelope (e.g., -3dB / sec), avoiding abrupt interruptions in the trailing note caused by energy gaps.

[0075] A song timbre matching system based on audio spectrum analysis is also provided, the system comprising:

[0076] The data acquisition module is used to acquire the original audio and user audio of the target song, acquire multiple song phrases corresponding to the target song, acquire the first matching spectrum segment corresponding to each song phrase based on the original audio, and acquire the second matching spectrum segment corresponding to each song phrase based on the user audio.

[0077] The data processing module is used to obtain the first spectral peak point of the first matching spectrum segment, and divide the first matching spectrum segment into a first front spectrum segment and a first rear spectrum segment according to the first spectral peak point, and obtain the second spectral peak point of the second matching spectrum segment, and divide the second matching spectrum segment into a second front spectrum segment and a second rear spectrum segment according to the second spectral peak point;

[0078] The data analysis module is used to obtain the front spectrum difference based on the first front spectrum segment and the second front spectrum segment, and to obtain the back spectrum difference based on the first back spectrum segment and the second back spectrum segment, and to determine whether the front spectrum difference and the back spectrum difference are both less than a preset threshold.

[0079] The matching generation module is used to adaptively match the duration of the second front spectrum segment based on the front spectrum difference and generate a target front spectrum segment if the difference between the front spectrum and the difference between the back spectrum are not both less than a preset threshold. It also adaptively matches the duration of the second back spectrum segment based on the back spectrum difference and generates a target back spectrum segment. The target front spectrum segment and the target back spectrum segment are used to construct the target sentence audio corresponding to the song sentence. Finally, the matched audio is obtained based on the target sentence audio corresponding to multiple song sentences.

[0080] In one implementation, the data acquisition module is further configured to: detect the fundamental frequency region in the first matching spectrum segment corresponding to the i-th song phrase; extend the boundary before and after the fundamental frequency region corresponding to the i-th song phrase until it extends to the spectrum transition point, and take the spectrum segment between the spectrum transition points before and after the fundamental frequency region as the first matching spectrum segment corresponding to the i-th song phrase.

[0081] In one embodiment, the data analysis module is further configured to: obtain the duration of the first front spectrum segment and obtain the duration of the second front spectrum segment; and use the absolute value of the difference between the duration of the first front spectrum segment and the duration of the second front spectrum segment as the front spectrum difference.

[0082] In one embodiment, the matching generation module is further configured to: if the duration of the first front spectrum segment exceeds the duration of the second front spectrum segment, obtain a front reduction factor based on the front spectrum difference, obtain a front matching duration based on the front reduction factor and the duration of the second front spectrum segment, and compress the second front spectrum segment according to the front matching duration to generate a target front spectrum segment with a duration equal to the front matching duration; if the duration of the first front spectrum segment does not exceed the duration of the second front spectrum segment, obtain a front amplification factor based on the front spectrum difference, obtain a front matching duration based on the front amplification factor and the duration of the second front spectrum segment, and amplify the second front spectrum segment according to the front matching duration to generate a target front spectrum segment with a duration equal to the front amplification factor.

[0083] In this embodiment, it should be noted that the specific method of performing the above-mentioned song timbre matching system based on audio spectrum analysis has been described in detail in the embodiments of the song timbre matching method based on audio spectrum analysis, and will not be elaborated here.

[0084] The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings. However, the present disclosure is not limited to the specific details of the above embodiments. Within the scope of the technical concept of the present disclosure, various simple modifications can be made to the technical solutions of the present disclosure, and these simple modifications all fall within the protection scope of the present disclosure.

[0085] It should also be noted that the combination of one or more letters “A, B, C, D, E, F, G, H, ..., Y, Z” described in the above specific embodiments can represent different song titles or single words, and the same combination of letters in different embodiments can represent different song titles or single words.

[0086] It should also be noted that the various specific technical features described in the above embodiments can be combined in any suitable manner without contradiction. To avoid unnecessary repetition, this disclosure will not describe the various possible combinations separately.

[0087] Furthermore, various different embodiments of this disclosure can be combined in any way, as long as they do not violate the spirit of this disclosure, they should also be regarded as the content disclosed in this disclosure.

[0088] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention, and they should all be covered within the scope of the claims and specification of the present invention.

Claims

1. A method for matching song timbre based on audio spectrum analysis, characterized in that, include: Obtain the original audio and user audio of the target song, obtain multiple song phrases corresponding to the target song, obtain the first matching spectrum segment corresponding to each song phrase based on the original audio, and obtain the second matching spectrum segment corresponding to each song phrase based on the user audio. Obtain the first spectral peak point of the first matching spectrum segment, and divide the first matching spectrum segment into a first front spectrum segment and a first rear spectrum segment based on the first spectral peak point; obtain the second spectral peak point of the second matching spectrum segment, and divide the second matching spectrum segment into a second front spectrum segment and a second rear spectrum segment based on the second spectral peak point. The difference between the front spectrum and the second front spectrum is obtained based on the first front spectrum segment and the second front spectrum segment, and the difference between the back spectrum and the second back spectrum segment is obtained based on the first back spectrum segment and the second back spectrum segment. It is then determined whether the difference between the front spectrum and the difference between the back spectrum are both less than a preset threshold. If the difference between the preceding spectrum and the difference between the following spectrum are not both less than a preset threshold, then the duration of the second preceding spectrum segment is adaptively matched based on the difference between the preceding spectrum and a target preceding spectrum segment is generated. The duration of the second following spectrum segment is adaptively matched based on the difference between the following spectrum and a target following spectrum segment is generated. The target segment audio corresponding to the song segment is constructed based on the target preceding spectrum segment and the target following spectrum segment. The matched audio is obtained based on the target segment audio corresponding to multiple song segments.

2. The song timbre matching method based on audio spectrum analysis according to claim 1, characterized in that, The step of obtaining the first matching spectrum segment corresponding to each song phrase based on the original audio includes: Detect the fundamental frequency region in the first matching spectrum segment corresponding to the i-th song phrase; Extend the boundary before and after the fundamental frequency region corresponding to the i-th song phrase until it reaches the spectral transition point, and take the spectral segment between the spectral transition points before and after the fundamental frequency region as the first matching spectral segment corresponding to the i-th song phrase.

3. The song timbre matching method based on audio spectrum analysis according to claim 1, characterized in that, The step of obtaining the first spectral peak point of the first matching spectral band includes: Obtain the fundamental frequency region in the first matched spectrum segment, and obtain the point of maximum vibration energy in the fundamental frequency region, and take the point of maximum vibration energy as the first spectrum peak point.

4. The song timbre matching method based on audio spectrum analysis according to claim 1, characterized in that, The step of obtaining the front spectrum difference based on the first front spectrum segment and the second front spectrum segment includes: Obtain the duration of the first front spectrum segment and the duration of the second front spectrum segment; The absolute value of the difference between the duration of the first pre-spectral segment and the duration of the second pre-spectral segment is used as the pre-spectral difference.

5. The song timbre matching method based on audio spectrum analysis according to claim 1, characterized in that, The step of adaptively matching the duration of the second pre-spectral segment based on the pre-spectral difference and generating the target pre-spectral segment includes: If the duration of the first front spectrum segment exceeds the duration of the second front spectrum segment, the front reduction factor is obtained based on the front spectrum difference, and the front matching duration is obtained based on the front reduction factor and the duration of the second front spectrum segment. The second front spectrum segment is then compressed according to the front matching duration to generate a target front spectrum segment with a duration equal to the front matching duration. If the duration of the first front spectrum segment does not exceed the duration of the second front spectrum segment, then the front amplification factor is obtained based on the front spectrum difference, and the front matching duration is obtained based on the front amplification factor and the duration of the second front spectrum segment. The second front spectrum segment is then amplified according to the front matching duration to generate a target front spectrum segment with a duration equal to the front amplification factor.

6. The song timbre matching method based on audio spectrum analysis according to claim 1, characterized in that, The step of adaptively matching the duration of the second subsequent spectrum segment based on the subsequent spectrum difference and generating the target subsequent spectrum segment includes: If the duration of the first subsequent spectrum segment exceeds the duration of the second subsequent spectrum segment, then the subsequent reduction factor is obtained based on the difference in the subsequent spectrum, and the subsequent matching duration is obtained based on the subsequent reduction factor and the duration of the second subsequent spectrum segment. The second subsequent spectrum segment is then compressed according to the subsequent matching duration to generate a target subsequent spectrum segment with a duration equal to the subsequent matching duration. If the duration of the first post-spectrum segment does not exceed the duration of the second post-spectrum segment, then the post-amplification factor is obtained based on the post-spectrum difference, and the post-matching duration is obtained based on the post-amplification factor and the duration of the second post-spectrum segment. The second post-spectrum segment is then amplified according to the post-matching duration to generate a target post-spectrum segment with a duration equal to the post-amplification factor.

7. A song timbre matching system based on audio spectrum analysis, characterized in that, The system includes: The data acquisition module is used to acquire the original audio and user audio of the target song, acquire multiple song phrases corresponding to the target song, acquire the first matching spectrum segment corresponding to each song phrase based on the original audio, and acquire the second matching spectrum segment corresponding to each song phrase based on the user audio. The data processing module is used to obtain the first spectral peak point of the first matching spectrum segment, and divide the first matching spectrum segment into a first front spectrum segment and a first rear spectrum segment according to the first spectral peak point, and obtain the second spectral peak point of the second matching spectrum segment, and divide the second matching spectrum segment into a second front spectrum segment and a second rear spectrum segment according to the second spectral peak point; The data analysis module is used to obtain the front spectrum difference based on the first front spectrum segment and the second front spectrum segment, and to obtain the back spectrum difference based on the first back spectrum segment and the second back spectrum segment, and to determine whether the front spectrum difference and the back spectrum difference are both less than a preset threshold. The matching generation module is used to adaptively match the duration of the second front spectrum segment based on the front spectrum difference and generate a target front spectrum segment if the difference between the front spectrum and the difference between the back spectrum are not both less than a preset threshold. It also adaptively matches the duration of the second back spectrum segment based on the back spectrum difference and generates a target back spectrum segment. The target front spectrum segment and the target back spectrum segment are used to construct the target sentence audio corresponding to the song sentence. Finally, the matched audio is obtained based on the target sentence audio corresponding to multiple song sentences.

8. The song timbre matching system based on audio spectrum analysis according to claim 7, characterized in that, The data acquisition module is also used for: Detect the fundamental frequency region in the first matching spectrum segment corresponding to the i-th song phrase; Extend the boundary before and after the fundamental frequency region corresponding to the i-th song phrase until it reaches the spectral transition point, and take the spectral segment between the spectral transition points before and after the fundamental frequency region as the first matching spectral segment corresponding to the i-th song phrase.

9. The song timbre matching system based on audio spectrum analysis according to claim 7, characterized in that, The data analysis module is also used for: Obtain the duration of the first front spectrum segment and the duration of the second front spectrum segment; The absolute value of the difference between the duration of the first pre-spectral segment and the duration of the second pre-spectral segment is used as the pre-spectral difference.

10. The song timbre matching system based on audio spectrum analysis according to claim 7, characterized in that, The matching generation module is also used for: If the duration of the first front spectrum segment exceeds the duration of the second front spectrum segment, the front reduction factor is obtained based on the front spectrum difference, and the front matching duration is obtained based on the front reduction factor and the duration of the second front spectrum segment. The second front spectrum segment is then compressed according to the front matching duration to generate a target front spectrum segment with a duration equal to the front matching duration. If the duration of the first front spectrum segment does not exceed the duration of the second front spectrum segment, then the front amplification factor is obtained based on the front spectrum difference, and the front matching duration is obtained based on the front amplification factor and the duration of the second front spectrum segment. The second front spectrum segment is then amplified according to the front matching duration to generate a target front spectrum segment with a duration equal to the front amplification factor.