A speech watermarking encoding and decoding method, apparatus, device and medium

By down-tuning the audio and embedding watermark signals in high-frequency hollow regions, the problem of high computing power consumption in existing technologies is solved, achieving low-cost and efficient voice watermark authentication, which is suitable for intelligent assistants or customer service systems in the medical and financial fields.

CN120708628BActive Publication Date: 2026-06-23PING AN TECH (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
PING AN TECH (SHENZHEN) CO LTD
Filing Date
2025-07-29
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing voice watermarking technology consumes a lot of computing power and is difficult to effectively identify the authenticity of generated voice, leading to the abuse of intelligent voice generation technology in illegal activities.

Method used

By down-pitching the audio, high-frequency hole regions are obtained, bandwidth is detected, watermark signals are generated, and watermark signals are embedded in the high-frequency hole regions. Decoding is performed using shared spectrum analysis, reducing frequency domain conversion and lowering computing power consumption.

Benefits of technology

It achieves low-computing-power voice watermark embedding and decoding, improves anti-counterfeiting efficiency, reduces computing costs, and is suitable for intelligent assistants or customer service systems in the medical and financial fields.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120708628B_ABST
    Figure CN120708628B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of artificial intelligence, is applied to intelligent medical treatment and financial scenes, and discloses a voice watermark encoding and decoding method, device, equipment and medium, which comprises the following steps: performing pitch reduction processing on audio, and obtaining a high-frequency hollow area of the pitch-reduced audio; detecting the bandwidth of the high-frequency hollow area of the pitch-reduced audio; generating a watermark signal according to a preset encoding string; embedding the watermark signal in the high-frequency hollow area of the pitch-reduced audio according to a preset watermark embedding rule, obtaining audio with a watermark, and outputting; receiving the audio with the watermark, extracting a high-frequency audio segment of the audio with the watermark; dividing the extracted high-frequency audio segment of the audio with the watermark according to the frame number corresponding to a single watermark character set of the preset watermark embedding rule, obtaining a character set unit frame group, and decoding the watermark signal of each character set unit frame group. The power consumption of voice watermark addition is reduced, and the calculation cost is reduced.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of artificial intelligence technology and speech processing technology, and in particular to a speech watermarking encoding and decoding method, apparatus, device and medium. Background Technology

[0002] Currently, with the rapid development of deepfake technology, the realism, naturalness, and similarity to the target person's voice in generated speech have been greatly improved, reaching a level that is indistinguishable from the real thing. While providing convenience for intelligent interactive applications or devices, intelligent speech generation technology also poses threats to information cognition and social security. In recent years, intelligent speech generation software, mainly based on speech synthesis and voice conversion, has spread widely on the Internet, lowering the technical threshold and cost of speech production. This has led to its use by criminals for various illegal or fraudulent activities. For example, generated speech can be used to impersonate intelligent customer service for medical insurance to conduct fake medical insurance authentication for users, or to imitate intelligent assistants in financial scenarios to steal users' online banking personal information. Therefore, it is necessary to add watermarks to the speech generated by intelligent customer service or intelligent assistants for authentication. However, most existing speech watermarking technologies rely on feature extraction and adaptive modulation, which consumes a lot of computing power. Summary of the Invention

[0003] This invention provides a voice watermarking encoding and decoding method, apparatus, device, and medium to solve the technical problem of high computational consumption in existing voice watermarking techniques.

[0004] Firstly, a voice watermarking encoding and decoding method is provided, including:

[0005] The audio is down-pitched to obtain the high-frequency hole region of the down-pitched audio.

[0006] Detect the bandwidth of the high-frequency hole region in the down-pitched audio;

[0007] A watermark signal is generated based on a preset encoded string;

[0008] According to the preset watermark embedding rules, the watermark signal is embedded in the high-frequency hole region of the down-pitched audio to obtain the watermarked audio and output it.

[0009] Receive watermarked audio and extract the high-frequency audio segments of the watermarked audio.

[0010] Based on the number of frames corresponding to a single watermark character set according to the preset watermark embedding rules, the high-frequency audio segments of the extracted watermarked audio are divided to obtain character set unit frame groups, and the watermark signal is decoded for each character set unit frame group.

[0011] Secondly, a voice watermarking encoding / decoding device is provided, comprising:

[0012] The audio pitch reduction module is used to reduce the pitch of audio and obtain the high-frequency hole region of the reduced audio.

[0013] The bandwidth detection module is used to detect the bandwidth of the high-frequency hole region in the down-pitched audio.

[0014] The watermark signal generation module is used to generate a watermark signal based on a preset encoded string;

[0015] The watermark embedding module is used to embed watermark signals in the high-frequency hole region of the down-pitched audio according to preset watermark embedding rules, so as to obtain watermarked audio and output it.

[0016] The high-frequency audio segment extraction module is used to receive watermarked audio and extract the high-frequency audio segments of the watermarked audio.

[0017] The watermark signal decoding module is used to divide the high-frequency audio segments of the extracted watermarked audio according to the number of frames corresponding to a single watermark character set based on the preset watermark embedding rules, to obtain character set unit frame groups, and to decode the watermark signal for each character set unit frame group.

[0018] Thirdly, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the above-described voice watermark encoding and decoding method.

[0019] Fourthly, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps of the above-described voice watermark encoding and decoding method.

[0020] In the aforementioned scheme implemented by the voice watermarking encoding and decoding method, apparatus, device, and medium, audio can be received by a client, the audio can be down-pitch processed to obtain the high-frequency hole region of the down-pitch audio; the bandwidth of the high-frequency hole region of the down-pitch audio can be detected; a watermark signal can be generated according to a preset encoded string; the watermark signal can be embedded in the high-frequency hole region of the down-pitch audio according to a preset watermark embedding rule to obtain watermarked audio and output it; the watermarked audio can be received, and the high-frequency audio segments of the watermarked audio can be extracted; the high-frequency audio segments of the extracted watermarked audio can be divided according to the number of frames corresponding to a single watermark character set in the preset watermark embedding rule to obtain character set unit frame groups, and the watermark signal can be decoded for each character set unit frame group. In this invention, for intelligent assistants in medical insurance authentication procedures under medical business, or for intelligent customer service for voice-protected bank accounts under financial business, the voice watermarking encoding and decoding scheme can be used to obtain the down-pitch audio by down-pitch processing the audio. The system detects the bandwidth of high-frequency hole regions in down-pitch audio, generates a watermark signal based on a preset encoded string, and embeds the watermark signal into the high-frequency hole regions of the down-pitch audio according to preset watermark embedding rules. This results in watermarked audio, which is then output for audio authentication. Utilizing the high-frequency hole regions after pitch down-pitch as the embedding carrier eliminates the need for additional feature extraction and adaptive adjustment processes, resulting in low computational consumption, reduced computational costs, and fast response times. Furthermore, it integrates pitch shifting and watermark encoding, deeply fusing pitch down-pitch processing with watermark embedding. By receiving watermarked audio, the system extracts high-frequency audio segments from the watermarked audio. These segments are then divided into character set unit frame groups based on the frame number corresponding to a single watermark character set according to preset watermark embedding rules. Watermark signal decoding is performed on each character set unit frame group. Shared spectrum analysis and signal processing reduce repetitive frequency domain conversions and eliminate the need for complex time-frequency transformations, thus improving processing efficiency. Attached Figure Description

[0021] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0022] Figure 1 This is a schematic diagram of an application environment for a voice watermarking encoding and decoding method according to an embodiment of the present invention;

[0023] Figure 2 This is a flowchart illustrating a voice watermarking encoding and decoding method in one embodiment of the present invention;

[0024] Figure 3yes Figure 2 A schematic diagram of a specific implementation of step S40;

[0025] Figure 4 This is a schematic diagram of a voice watermark encoding and decoding device in one embodiment of the present invention;

[0026] Figure 5 This is a schematic diagram of the structure of a computer device according to an embodiment of the present invention;

[0027] Figure 6 This is another structural schematic diagram of a computer device according to one embodiment of the present invention. Detailed Implementation

[0028] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0029] The voice watermarking encoding and decoding method provided in this invention can be applied to, for example... Figure 1In application environments such as healthcare and finance, intelligent assistants or intelligent customer service systems typically utilize a server-side architecture, where the client communicates with the server via a network. The server receives audio from the client, performs pitch reduction processing to obtain the high-frequency hole region of the reduced-pitch audio; detects the bandwidth of the high-frequency hole region of the reduced-pitch audio; generates a watermark signal according to a preset encoded string; embeds the watermark signal into the high-frequency hole region of the reduced-pitch audio according to preset watermark embedding rules, obtaining and outputting the watermarked audio; receives the watermarked audio, extracts the high-frequency audio segments of the watermarked audio; divides the extracted high-frequency audio segments of the watermarked audio according to the number of frames corresponding to a single watermark character set according to the preset watermark embedding rules, obtaining character set unit frame groups; and decodes the watermark signal for each character set unit frame group. In this invention, for intelligent assistants in medical insurance authentication procedures under healthcare services, or for intelligent customer service for voice-protected bank accounts under financial services, a voice watermark encoding and decoding scheme can be used to obtain the high-frequency hole region of the reduced-pitch audio by performing pitch reduction processing on the audio and detecting the pitch reduction... The high-frequency hole region of the down-pitch audio is used to generate a watermark signal based on a preset encoded string. According to preset watermark embedding rules, the watermark signal is embedded in the high-frequency hole region of the down-pitch audio to obtain watermarked audio for audio authentication. Utilizing the high-frequency hole region after pitch down-pitch as the embedding carrier eliminates the need for additional feature extraction and adaptive adjustment processes, resulting in low computational consumption, reduced computational costs, and fast response. Furthermore, it integrates pitch shifting and watermark encoding, deeply fusing pitch down-pitch processing with watermark embedding. By receiving watermarked audio, the high-frequency audio segments of the watermarked audio are extracted. Based on the frame number corresponding to a single watermark character set according to preset watermark embedding rules, the extracted high-frequency audio segments are divided into character set unit frame groups. The watermark signal is decoded for each character set unit frame group. Through shared spectrum analysis and signal processing, repetitive frequency domain conversions are reduced, eliminating the need for complex time-frequency transformations and improving processing efficiency. The client can be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented using a standalone server or a server cluster consisting of multiple servers. The invention will be described in detail below through specific embodiments.

[0030] Please see Figure 2 As shown, Figure 2 A flowchart illustrating the voice watermarking encoding and decoding method provided in this embodiment of the invention includes the following steps:

[0031] S10: Perform pitch reduction processing on the audio to obtain the high-frequency hole region of the down-pitched audio.

[0032] The voice watermarking encoding and decoding method provided by this invention can be applied to intelligent customer service or intelligent assistants in various application scenarios such as healthcare, finance, and insurance, and is typically implemented through a server. For example, in the medical insurance authentication process, an intelligent assistant allows users to verify the applicant's identity, confirm their application intention, or understand their health status through voice communication methods such as telephone authentication or remote consultation. The intelligent assistant generates a response audio based on the user's audio. Alternatively, in the financial field, for example, in intelligent customer service for voice-protected bank accounts, users can perform information setting and transaction operations on their bank accounts via voice. The intelligent customer service generates corresponding operation prompt audio based on the user's operation. After generating the response audio or operation prompt audio, the intelligent assistant or intelligent customer service can perform pitch reduction processing on the audio to obtain the high-frequency hole region of the reduced-pitch audio. This high-frequency hole region can then be used as the carrier for watermark embedding, eliminating the need for additional feature extraction and adaptive modulation processes, thus reducing computational power consumption and computational costs.

[0033] In step S10, which involves lowering the pitch of the audio, the pitch can be lowered by two semitones.

[0034] S20: Detects the bandwidth of the high-frequency hole region in the down-pitched audio.

[0035] Preferably, step S20, namely the bandwidth of the high-frequency hole region of the down-modulated audio, specifically includes:

[0036] The bandwidth of the high-frequency hole region of the down-pitched audio is determined by detecting the audio sampling rate and down-pitch processing parameters.

[0037] The audio sampling rate can be 16kHz, and the down-pitch processing parameter can be the down-pitch interval. In this embodiment, the down-pitch processing parameter is 2 semitones. Therefore, after down-pitch processing, a bandwidth of approximately 1kHz high-frequency hole region can be obtained. Holes appear in the spectrum of the down-pitch audio at approximately 7kHz to 8kHz, which is the high-frequency hole region. The high-frequency hole region contains almost no energy, and adding a watermark to the high-frequency hole region can effectively decouple it from other regions. The reliability of the high-frequency hole region can be ensured by detecting its bandwidth.

[0038] S30: Generate a watermark signal based on a preset encoded string.

[0039] The preset encoding string can be set according to actual needs, for example, "PINGAN".

[0040] S40: According to the preset watermark embedding rules, embed the watermark signal in the high-frequency hole region of the down-pitched audio to obtain the watermarked audio and output it.

[0041] In the embodiment of the invention, step S40, namely, according to the preset watermark embedding rules, includes a watermark character set, a short-time Fourier transform frame length, a short-time Fourier transform frame shift, and the number of frames corresponding to a single watermark character set. The number of frames corresponding to a single watermark character set represents the audio frame interval corresponding to a single watermark character in the watermark signal. The short-time Fourier transform frame length can be 32ms, and the short-time Fourier transform frame shift can be 16ms. By outputting watermarked audio, users can easily identify whether the output operation prompt audio or response audio is from a legitimate intelligent customer service system for voice-protected bank accounts in the financial application field or an intelligent assistant under a medical insurance authentication procedure in the medical application field.

[0042] Preferably, the watermark character set includes twenty-six uppercase English characters, ten Arabic numeral characters, and four special symbol characters, and the number of frames corresponding to a single watermark character set is four. The four special symbol characters can be a forward slash, underscore, asterisk, and hash symbol, so the watermark character set includes forty characters, namely A~Z, 0~9, / , _, *, and #.

[0043] Among them, such as Figure 3 As shown, step S40, which involves embedding the watermark signal in the high-frequency hole region of the down-modulated audio according to the preset watermark embedding rules, includes the following steps:

[0044] S41: The watermark character set is divided into frames according to the character order within the watermark character set and the number of frames corresponding to a single watermark character set, to obtain the range of watermark characters represented by each frame within the frame number corresponding to a single watermark character set. Where a single watermark character set corresponds to four frames, the watermark character range represented by the first frame is ABCDEFGHIJ, the watermark character range represented by the second frame is KLMNOPQRST, the watermark character range represented by the third frame is UVWXYZ0123, and the watermark character range represented by the fourth frame is 456789 / _*#.

[0045] S42: Based on the frame length of the short-time Fourier transform, obtain the number of frequency bands in each audio frame, and sort each audio frame in descending order of frequency, so that the frequency band with the highest frequency has the largest index. The frame length of the short-time Fourier transform is 32ms, and each audio frame has 257 frequency bands.

[0046] S43: Based on the range of watermark characters represented by each frame in the frame number corresponding to a single watermark character set, allocate characters to each audio frame in descending order of frequency band, and obtain the mapping relationship between the frequency band position and the watermark character in each frame in the frame number corresponding to a single watermark character set. Then, in the four frames corresponding to a single watermark character set, the watermark characters corresponding to the 257th to 248th frequency bands of the first audio frame are A to J, the watermark characters corresponding to the 257th to 248th frequency bands of the second audio frame are K to T, the watermark characters corresponding to the 257th to 248th frequency bands of the third audio frame are U to Z and 0 to 3, and the watermark characters corresponding to the 257th to 248th frequency bands of the fourth audio frame are 4 to 9, / , _, *, and #.

[0047] S44: Based on the mapping relationship between the frequency band position of each frame in the frame number corresponding to a single watermark character set and the watermark character, and combined with the preset encoding string corresponding to the watermark signal, the watermark signal is embedded in the high-frequency hole region of the down-pitch audio. Embedding the watermark signal in the high-frequency hole region of the down-pitch audio means, based on the mapping relationship between the frequency band position of each frame in the frame number corresponding to a single watermark character set and the watermark character, and combined with the preset encoding string corresponding to the watermark signal, embedding the corresponding watermark character into the frequency band position of the audio frame corresponding to the high-frequency hole region of the down-pitch audio. For example, when the preset encoded string is "PINGAN", based on the mapping relationship between the frequency band position of each frame in the frame corresponding to a single watermark character set and the watermark character, the frequency band positions to be embedded for the watermark character can be determined as follows: the 252nd frequency band (P) of the second frame of the first single watermark character set, the 249th frequency band (I) of the first frame of the second single watermark character set, the 254th frequency band (N) of the second frame of the third single watermark character set, and the 254th frequency band (N) of the fourth single watermark character set. The 251st frequency band (G) of the first frame, the 257th frequency band (A) of the first frame corresponding to the fifth single watermark character set, and the 254th frequency band (N) of the second frame corresponding to the sixth single watermark character set, that is, the frequency band positions where the watermark characters need to be embedded are respectively the 252nd frequency band (P) of the second frame of the total audio, the 249th frequency band (I) of the fifth frame, the 254th frequency band (N) of the tenth frame, the 251st frequency band (G) of the thirteenth frame, the 257th frequency band (A) of the seventeenth frame, and the 254th frequency band (N) of the twenty-second frame.

[0048] Preferably, when a visible watermark is required, i.e., when the watermark can be heard directly from the audio, the embedding energy can be increased; when a hidden watermark is required, i.e., when the watermark cannot be heard from the audio, the embedding energy can be decreased.

[0049] S50: Receives watermarked audio and extracts the high-frequency audio segment of the watermarked audio.

[0050] In some embodiments of the invention, step S50, namely extracting the high-frequency audio segment of the watermarked audio, includes:

[0051] Based on the frame length and frame shift of the short-time Fourier transform of the watermarked audio according to the preset watermark embedding rules, a short-time Fourier transform is performed on the watermarked audio to obtain the audio amplitude spectrum.

[0052] In the audio amplitude spectrum, the amplitude spectrum corresponding to a preset number of frequency bands is extracted from each audio frame in descending order of frequency to obtain the high-frequency audio segment of the watermarked audio. The preset number is equal to the number of characters in the watermark character range represented by each frame. Since the watermark character set can include forty characters, and each watermark character set corresponds to four frames, the number of characters in the watermark character range represented by each frame is ten, hence the preset number is ten. Extracting the amplitude spectrum corresponding to the preset number of frequency bands from each audio frame in descending order of frequency in the audio amplitude spectrum is equivalent to extracting the amplitude spectrum corresponding to the ten highest frequency bands from each audio frame in descending order of frequency in the audio amplitude spectrum to obtain the high-frequency audio segment of the watermarked audio.

[0053] S60: Divide the high-frequency audio segments of the extracted watermarked audio into character set unit frame groups according to the frame number corresponding to a single watermark character set based on the preset watermark embedding rules, and decode the watermark signal for each character set unit frame group.

[0054] The watermark encoding and decoding share spectral analysis, which reduces repetitive frequency domain conversion operations and improves processing efficiency. Step S60, which involves decoding the watermark signal for each character set unit frame group, includes:

[0055] Calculate the frequency point with the highest energy in each character set unit frame group, and obtain the corresponding character according to the preset decoding mapping rules;

[0056] The characters corresponding to all the obtained character set unit frame groups are sequentially combined into a string.

[0057] As can be seen, in the above solution, for intelligent assistants in medical insurance authentication procedures under medical services, or for intelligent customer service for voice-protected bank accounts under financial services, a voice watermarking encoding and decoding scheme can be used. This involves down-pitch processing of the audio to obtain the high-frequency hole region of the down-pitch audio, detecting the bandwidth of the high-frequency hole region, generating a watermark signal according to a preset encoded string, and embedding the watermark signal into the high-frequency hole region of the down-pitch audio according to preset watermark embedding rules. This results in watermarked audio that is then output for audio authentication. Utilizing the high-frequency hole region after down-pitch processing as the embedding carrier eliminates the need for additional... The feature extraction and adaptive adjustment process consumes little computing power, reducing computational costs and providing fast response. Moreover, it constructs an integrated processing of pitch shifting and watermark encoding, deeply fusing pitch reduction processing with watermark embedding. By receiving watermarked audio, it extracts the high-frequency audio segments of the watermarked audio. According to the frame number corresponding to a single watermark character set in the preset watermark embedding rules, the extracted high-frequency audio segments of the watermarked audio are divided to obtain character set unit frame groups. Watermark signal decoding is performed on each character set unit frame group. By sharing spectrum analysis and signal processing, it reduces repetitive frequency domain conversion and eliminates the need for complex time-frequency transformation, thereby improving processing efficiency.

[0058] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0059] In one embodiment, a voice watermarking encoding / decoding device is provided, which corresponds one-to-one with the voice watermarking encoding / decoding methods described in the above embodiments. For example... Figure 4 As shown, the voice watermark encoding and decoding device includes an audio pitch reduction module 101, a bandwidth detection module 102, a watermark signal generation module 103, a watermark embedding module 104, a high-frequency audio segment extraction module 105, and a watermark signal decoding module 106. Detailed descriptions of each functional module are as follows:

[0060] The audio pitch reduction module 101 is used to perform pitch reduction processing on the audio and obtain the high-frequency hole region of the down-pitch audio.

[0061] The bandwidth detection module 102 is used to detect the bandwidth of the high-frequency hole region of the down-pitched audio.

[0062] The watermark signal generation module 103 is used to generate a watermark signal according to a preset encoded string;

[0063] The watermark embedding module 104 is used to embed watermark signals in the high-frequency hole region of the down-pitched audio according to the preset watermark embedding rules, so as to obtain and output the watermarked audio.

[0064] The high-frequency audio segment extraction module 105 is used to receive watermarked audio and extract the high-frequency audio segment of the watermarked audio.

[0065] The watermark signal decoding module 106 is used to divide the high-frequency audio segments of the extracted watermarked audio according to the number of frames corresponding to a single watermark character set according to the preset watermark embedding rules, to obtain character set unit frame groups, and to decode the watermark signal for each character set unit frame group.

[0066] In one embodiment, the bandwidth detection module 102 is specifically used for:

[0067] The bandwidth of the high-frequency hole region of the down-pitched audio is determined by detecting the audio sampling rate and down-pitch processing parameters.

[0068] In one embodiment, the watermark embedding module 104 is specifically used for:

[0069] The watermark character set is divided into frames according to the character order in the watermark character set and the number of frames corresponding to a single watermark character set, so as to obtain the range of watermark characters represented by each frame in the number of frames corresponding to a single watermark character set.

[0070] Based on the frame length of the short-time Fourier transform, obtain the number of frequency bands in each audio frame, and sort each audio frame in descending order of frequency so that the highest frequency band has the largest corresponding number.

[0071] Based on the range of watermark characters represented by each frame in the frame number corresponding to a single watermark character set, the audio of each frame is assigned characters in descending order of frequency band, and the mapping relationship between the frequency band position of each frame in the frame number corresponding to a single watermark character set and the watermark characters is obtained.

[0072] Based on the mapping relationship between the frequency band position of each frame in the frame corresponding to a single watermark character set and the watermark character, and combined with the preset encoded string corresponding to the watermark signal, the watermark signal is embedded in the high-frequency hole region of the down-modulated audio.

[0073] In one embodiment, the high-frequency audio segment extraction module 105 is specifically used for:

[0074] Based on the frame length and frame shift of the short-time Fourier transform of the watermarked audio according to the preset watermark embedding rules, a short-time Fourier transform is performed on the watermarked audio to obtain the audio amplitude spectrum.

[0075] In the audio amplitude spectrum, the amplitude spectrum corresponding to a preset number of frequency bands is extracted from each frame of audio in descending order of frequency to obtain the high-frequency audio segment of the watermarked audio. The preset number is equal to the number of characters in the watermark character range represented by each frame.

[0076] In one embodiment, the watermark decoding module 102 is specifically used for:

[0077] Calculate the frequency point with the highest energy in each character set unit frame group, and obtain the corresponding character according to the preset decoding mapping rules;

[0078] The characters corresponding to all the obtained character set unit frame groups are sequentially combined into a string.

[0079] This invention provides a voice watermarking encoding and decoding device. It performs pitch reduction processing on audio to obtain high-frequency hole regions in the reduced-pitch audio, detects the bandwidth of these high-frequency hole regions, generates a watermark signal based on a preset encoded string, and embeds the watermark signal into the high-frequency hole regions of the reduced-pitch audio according to preset watermark embedding rules. This results in watermarked audio, which is then output for audio authentication. Utilizing the high-frequency hole regions after pitch reduction as the embedding carrier eliminates the need for additional feature extraction and adaptive adjustment processes, resulting in low computational consumption, reduced computational costs, and fast response. Furthermore, it integrates pitch reduction and watermark encoding, deeply fusing the two processes. By receiving watermarked audio, it extracts high-frequency audio segments from the watermarked audio. Based on the frame number corresponding to a single watermark character set according to preset watermark embedding rules, the extracted high-frequency audio segments are divided into character set unit frame groups. Watermark signal decoding is performed on each character set unit frame group. Through shared spectrum analysis and signal processing, repetitive frequency domain conversions are reduced, eliminating the need for complex time-frequency transformations and improving processing efficiency.

[0080] Specific limitations regarding the voice watermarking encoding / decoding device can be found in the limitations of the voice watermarking encoding / decoding method described above, and will not be repeated here. Each module in the aforementioned voice watermarking encoding / decoding device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the corresponding operations of each module.

[0081] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 5 As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile and / or volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface is used to communicate with external clients via a network connection. When the computer program is executed by the processor, it implements the functions or steps of a voice watermarking encoding / decoding method on the server side.

[0082] In one embodiment, a computer device is provided, which may be a client, and its internal structure diagram may be as follows: Figure 6 As shown, the computer device includes a processor, memory, network interface, display screen, and input devices connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface is used to communicate with an external server via a network connection. When executed by the processor, the computer program implements the client-side functions or steps of a voice watermarking encoding / decoding method.

[0083] In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to perform the following steps:

[0084] The audio is down-pitched to obtain the high-frequency hole region of the down-pitched audio.

[0085] Detect the bandwidth of the high-frequency hole region in the down-pitched audio;

[0086] A watermark signal is generated based on a preset encoded string;

[0087] According to the preset watermark embedding rules, the watermark signal is embedded in the high-frequency hole region of the down-pitched audio to obtain the watermarked audio and output it.

[0088] Receive watermarked audio and extract the high-frequency audio segments of the watermarked audio.

[0089] Based on the number of frames corresponding to a single watermark character set according to the preset watermark embedding rules, the high-frequency audio segments of the extracted watermarked audio are divided to obtain character set unit frame groups, and the watermark signal is decoded for each character set unit frame group.

[0090] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, the computer program performing the following steps when executed by a processor:

[0091] The audio is down-pitched to obtain the high-frequency hole region of the down-pitched audio.

[0092] Detect the bandwidth of the high-frequency hole region in the down-pitched audio;

[0093] A watermark signal is generated based on a preset encoded string;

[0094] According to the preset watermark embedding rules, the watermark signal is embedded in the high-frequency hole region of the down-pitched audio to obtain the watermarked audio and output it.

[0095] Receive watermarked audio and extract the high-frequency audio segments of the watermarked audio.

[0096] Based on the number of frames corresponding to a single watermark character set according to the preset watermark embedding rules, the high-frequency audio segments of the extracted watermarked audio are divided to obtain character set unit frame groups, and the watermark signal is decoded for each character set unit frame group.

[0097] It should be noted that the functions or steps that can be implemented by the computer-readable storage medium or computer device described above can be referred to the relevant descriptions on the server side and client side in the foregoing method embodiments. To avoid repetition, they will not be described one by one here.

[0098] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0099] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.

[0100] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.

Claims

1. A voice watermark encoding and decoding method, characterized in that, include: The audio is down-pitched to obtain the high-frequency hole region of the down-pitched audio. Detect the bandwidth of the high-frequency hole region in the down-pitched audio; A watermark signal is generated based on a preset encoded string; According to the preset watermark embedding rules, the watermark signal is embedded in the high-frequency hole region of the down-pitched audio to obtain the watermarked audio and output it. Receive watermarked audio and extract the high-frequency audio segments of the watermarked audio. According to the preset watermark embedding rules, the high-frequency audio segments of the extracted watermarked audio are divided into character set unit frame groups based on the number of frames corresponding to a single watermark character set. The watermark signal is then decoded for each character set unit frame group. The preset watermark embedding rules include the watermark character set, the frame length of the short-time Fourier transform, the frame shift of the short-time Fourier transform, and the number of frames corresponding to a single watermark character set. The step of embedding the watermark signal in the high-frequency hole region of the down-modulated audio according to the preset watermark embedding rules includes: The watermark character set is divided into frames according to the character order in the watermark character set and the number of frames corresponding to a single watermark character set, so as to obtain the range of watermark characters represented by each frame in the number of frames corresponding to a single watermark character set. Based on the frame length of the short-time Fourier transform, obtain the number of frequency bands in each audio frame, and sort each audio frame in descending order of frequency so that the highest frequency band has the largest corresponding number. Based on the range of watermark characters represented by each frame in the frame number corresponding to a single watermark character set, the audio of each frame is assigned characters in descending order of frequency band, and the mapping relationship between the frequency band position of each frame in the frame number corresponding to a single watermark character set and the watermark characters is obtained. Based on the mapping relationship between the frequency band position of each frame in the frame corresponding to a single watermark character set and the watermark character, and combined with the preset encoded string corresponding to the watermark signal, the watermark signal is embedded in the high-frequency hole region of the down-modulated audio. The extraction of the high-frequency audio segment of the watermarked audio includes: Based on the frame length and frame shift of the short-time Fourier transform of the watermarked audio according to the preset watermark embedding rules, a short-time Fourier transform is performed on the watermarked audio to obtain the audio amplitude spectrum. In the audio amplitude spectrum, the amplitude spectrum corresponding to a preset number of frequency bands is extracted from each frame of audio in descending order of frequency to obtain the high-frequency audio segment of the watermarked audio. The preset number is equal to the number of characters in the watermark character range represented by each frame.

2. The voice watermarking encoding and decoding method as described in claim 1, characterized in that, The bandwidth of the high-frequency hole region of the down-modulated audio is specifically: The bandwidth of the high-frequency hole region of the down-pitched audio is determined by detecting the audio sampling rate and down-pitch processing parameters.

3. The voice watermarking encoding and decoding method as described in claim 1, characterized in that, The watermark character set includes twenty-six uppercase English characters, ten Arabic numeral characters, and four special symbol characters. The number of frames corresponding to a single watermark character set is four. The four special symbol characters are forward slash, underscore, asterisk, and hash symbol.

4. The voice watermarking encoding and decoding method as described in claim 1, characterized in that, The process of decoding the watermark signal for each character set unit frame group includes: Calculate the frequency point with the highest energy in each character set unit frame group, and obtain the corresponding character according to the preset decoding mapping rules; The characters corresponding to all the obtained character set unit frame groups are sequentially combined into a string.

5. A voice watermark encoding and decoding device, characterized in that, include: The audio pitch reduction module is used to reduce the pitch of audio and obtain the high-frequency hole region of the reduced audio. The bandwidth detection module is used to detect the bandwidth of the high-frequency hole region in the down-pitched audio. The watermark signal generation module is used to generate a watermark signal based on a preset encoded string; The watermark embedding module is used to embed watermark signals in the high-frequency hole region of the down-pitched audio according to preset watermark embedding rules, so as to obtain watermarked audio and output it. The high-frequency audio segment extraction module is used to receive watermarked audio and extract the high-frequency audio segments of the watermarked audio. The watermark signal decoding module is used to divide the high-frequency audio segments of the extracted watermarked audio according to the number of frames corresponding to a single watermark character set according to the preset watermark embedding rules, to obtain character set unit frame groups, and to decode the watermark signal for each character set unit frame group. Specifically, the watermark embedding module is used to: map the watermark character set according to the character order in the watermark character set and the number of frames corresponding to a single watermark character set, to obtain the range of watermark characters represented by each frame in the number of frames corresponding to a single watermark character set; obtain the number of frequency bands of each audio frame according to the frame length of the short-time Fourier transform, and sort each audio frame according to the frequency from high to low, so that the sequence number corresponding to the highest frequency band is the largest; allocate characters to each audio frame according to the range of watermark characters represented by each frame in the number of frames corresponding to a single watermark character set, in order of frequency band from high to low, to obtain the mapping relationship between the frequency band position and the watermark character in the number of frames corresponding to a single watermark character set; and embed the watermark signal in the high-frequency hole region of the down-modulated audio according to the mapping relationship between the frequency band position and the watermark character in the number of frames corresponding to a single watermark character set, combined with the preset encoded string corresponding to the watermark signal. The high-frequency audio segment extraction module is specifically used to: perform a short-time Fourier transform on the watermarked audio according to the frame length and frame shift of the short-time Fourier transform based on the preset watermark embedding rules, to obtain the audio amplitude spectrum; and extract the amplitude spectrum corresponding to a preset number of frequency bands in each frame of audio in descending order of frequency from the audio amplitude spectrum to obtain the high-frequency audio segment of the watermarked audio, wherein the preset number is equal to the number of characters in the watermark character range represented by each frame.

6. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the voice watermarking encoding / decoding method as described in any one of claims 1 to 4.

7. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the voice watermarking encoding and decoding method as described in any one of claims 1 to 4.