A high concealment audio forgery positioning method and system based on hierarchical sequence labeling

By employing a hierarchical sequence annotation method, combined with EAT, Conformer, Bi-LSTM, and CRF layers, the problem of locating highly covert audio forgeries driven by a large language model was solved. This method achieves accurate localization and structured prediction of local forgeries, improving the accuracy and stability of detection.

CN122245350APending Publication Date: 2026-06-19XIANGTAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XIANGTAN UNIV
Filing Date
2026-05-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies lack effective localization capabilities when detecting and locating local deep forgery attacks driven by large language models. In particular, they have weak perception capabilities at splicing boundaries, and the prediction results are easily fragmented, making it difficult to capture the weak acoustic boundaries of highly covert audio forgeries and the tampering logic of specific language models.

Method used

A hierarchical sequence labeling-based approach is adopted, which extracts high-dimensional acoustic features through a pre-trained EAT model, combines Conformer and Bi-LSTM networks for temporal modeling, uses CRF layers for structured prediction, and introduces a boundary-aware hybrid loss function to optimize the model's localization accuracy of splicing boundaries.

Benefits of technology

It achieves precise localization of highly covert audio forgeries, improves localization accuracy and boundary clarity, effectively identifies extremely short semantically tampered segments and suppresses random noise, and has strong generalization ability and robust detection performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245350A_ABST
    Figure CN122245350A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of artificial intelligence and cyberspace security technology, specifically disclosing a highly covert audio forgery localization method and system based on hierarchical sequence labeling. The method includes: extracting high-dimensional acoustic features sensitive to non-speech artifacts using pre-trained EAT; performing multi-scale temporal modeling through stacked Conformer and bidirectional LSTM networks to capture local acoustic correlations and long-range prosodic dependencies; finally, using a CRF layer combined with emission and transition fractions, and decoding with the Viterbi algorithm to obtain structurally consistent forged label sequences; and introducing a boundary-aware hybrid loss function during model training to enhance the recognition of weak traces at splicing boundaries. This invention effectively solves the problem of fragmented predictions in traditional models, accurately locates local acoustic inconsistencies caused by highly covert tampering, and exhibits high robustness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence and cyberspace security technology, and in particular relates to a highly covert audio forgery location method and system based on hierarchical sequence labeling. Background Technology

[0002] With the rapid development of deep learning technology, generative artificial intelligence has made breakthrough progress in the fields of text-to-speech (TTS) and voice conversion. Existing zero-shot TTS systems can synthesize high-fidelity, high-similarity cloned speech with only a few seconds of reference audio.

[0003] However, technological advancements have also brought serious security risks. Traditional audio spoofing typically involves the synthesis of entire audio segments, and detection methods are mostly based on binary classification tasks, i.e., determining the authenticity of the entire audio segment. Recently, a more covert and dangerous "half-true, half-false" attack mode has gradually emerged. In this type of attack, malicious attackers use large language models to precisely tamper with key words in the speech; such as changing "agree" to "reject," or tampering with financial data; and then use TTS synthesis technology to regenerate the tampered segments, and then seamlessly splice them back into the original audio.

[0004] Existing technologies have significant shortcomings in dealing with highly covert local forgeries driven by large models, specifically including the following aspects: 1) Coarse detection granularity: Mainstream forgery detection models (such as RawNet2 and AASIST) are mainly designed for sentence-level binary classification, lacking frame-level localization capabilities and unable to pinpoint the specific location of the forgery. 2) Fragmented localization: The few methods that attempt frame-level detection typically employ a simple "feature extraction + classifier" architecture, ignoring the temporal dependencies between labels, resulting in "breakpoints" or "jumps" in the prediction results within consecutive forged segments, i.e., prediction fragmentation. 3) Weak boundary awareness: Existing training loss functions (such as standard cross-entropy) treat all frames in the sample equally, while the most crucial physical evidence of splicing attacks is often hidden at extremely short splicing boundaries (such as phase discontinuities or abrupt changes in background noise). Existing methods struggle to capture these subtle boundary features.

[0005] While research on general audio forgery detection exists in this field, these methods typically focus on general acoustic scenarios and lack modeling of subtle acoustic boundaries, phonological consistency, and language-model-specific tampering logic caused by highly concealed short-term splicing. Therefore, there is an urgent need for a highly concealed local audio forgery localization technique that can deeply understand long-term temporal dependencies, possess structured predictive capabilities, and accurately capture splicing boundaries.

[0006] Chinese invention patent application CN202310965261.6 discloses an audio forgery localization and detection method based on normalized flow. This method extracts multi-scale features from the audio and maps these features to a normal distribution space using normalized flow. Outliers in the audio are evaluated through likelihood estimation of latent space variables, and their gradients are used for forgery localization. This provides a clear theoretical background for deep synthetic audio detection and enhances its interpretability. By extracting multi-scale features from the audio, this invention can obtain both global and local nuances, improving the accuracy of audio forgery detection and localization. Furthermore, it requires only real audio samples for training, significantly increasing the generalization ability of forgery detection and exhibiting strong robustness and generalization. The invention also achieves audio forgery point localization by propagating negative log-likelihood back to the input audio to locate forgery anomalies through the gradient of the audio signal.

[0007] This invention aims to solve the problem of detecting deeply synthesized audio, explicitly stating its goal is not only to "detect" but also to "locate" forged regions. It also employs deep neural networks for feature extraction and processing of the audio. The invention uses "multi-scale feature fusion" and "one-dimensional convolutions of different sizes" to obtain subtle global and local information. However, in its selection of feature extractors, it uses a self-built multi-scale one-dimensional convolutional network, which is not sensitive enough to subtle editing traces. Regarding the depth of its context modeling architecture, the invention adopts a normalized flow, whose core logic is to map features to a standard normal distribution and determine whether a sample falls within a high-probability interval. It focuses more on the distribution anomalies of statistical features than on the semantic / temporal logical breaks in the context. Furthermore, the invention models the localization problem as an anomaly attribution problem. It first calculates the anomaly score (likelihood) of the entire audio segment, and then calculates the gradient by propagating the negative log-likelihood back to the input audio. The invention uses the magnitude of the gradient to infer which frames contribute the most to the anomaly, thereby locating the region. However, the localization in this invention relies entirely on the magnitude of the gradient, which can easily lead to fragmented and discontinuous localization results.

[0008] Chinese invention patent application CN202410486972.X provides a method for training a deepfake audio detection model, an electronic device, and a storage medium. The method for training a deepfake audio detection model includes: using a localization model to locate key frames of audio; using a generative, high-resolution model to reconstruct the key frames to obtain deepfake audio; and using the deepfake audio to train the model so that the model can simultaneously identify the deepfake audio and locate the deepfake portion.

[0009] This invention also addresses the problem of locating localized forgeries, employing a multi-task approach of "localization + detection." It utilizes "self-supervised EAT extraction of frame-level audio representation" and a single-layer bidirectional long short-term memory network to capture temporal information. However, its localization mechanism is median filtering, a simple signal processing technique used to remove "isolated noise predictions." It relies on non-learned, hard-coded rules and lacks understanding of "semantic logic," mechanically smoothing out short impulses. For extremely short words commonly found in LLM forgery, median filtering is highly likely to misclassify them as noise and remove them, leading to missed detections. Furthermore, this invention focuses on restoring general audio, such as "keyboard sounds" and "ambient sounds," which typically involve patching a blank space with relatively smooth boundaries; therefore, this invention does not focus on speech content forgery.

[0010] Therefore, there is a need in this field for a new method and system for locating highly covert audio forgery based on hierarchical sequence annotation. Summary of the Invention

[0011] The localization method and system described in this invention are specifically applied to the detection and localization of local deep forgery attacks driven by large language models and aimed at altering the core content of speech.

[0012] The purpose of this invention is to address the shortcomings of existing audio forgery detection methods, which mainly focus on binary classification of complete audio and lack effective localization capabilities for "partially true, partially false" forgery attacks driven by large language models and involving only the alteration of local segments to change core content vocabulary. Furthermore, existing models suffer from weak perception at splicing boundaries and fragmented prediction sequences. This invention provides a method and system for local audio forgery localization based on hierarchical sequence annotation. By constructing a targeted dataset of locally altered forgeries and utilizing hierarchical temporal dependency modeling and boundary-aware optimization, this invention accurately locates subtle acoustic inconsistencies introduced by content alteration in audio.

[0013] This invention first provides a highly covert audio spoofing localization method based on hierarchical sequence labeling, comprising the following steps: Step S1, Acoustic Feature Extraction: Receive the input audio waveform, convert it into a Mel spectrogram, and extract a high-dimensional acoustic feature sequence using a pre-trained upstream feature extractor; the upstream feature extractor is a self-supervised pre-trained high-efficiency audio converter model, i.e., the EAT model; Step S2, Temporal Dependency Modeling: Input the high-dimensional acoustic feature sequence into a downstream temporal encoder, which includes a serially connected Conformer encoder and a Bi-LSTM network, i.e., a bidirectional long short-term memory network. The stacked Conformer modules and Bi-LSTM network are used to model the local correlation, global contextual dependency, and long-range temporal structure of the high-dimensional acoustic feature sequence, generating a context-aware hidden state sequence H. BiLSTMStep S3, Structured Sequence Prediction: The H... BiLSTM The emission scores are mapped to emission scores, and a structured prediction layer, which is a conditional random field layer (CRF layer), is used. The CRF layer combines the emission scores with the transition scores between tags and uses the Viterbi algorithm to decode and obtain the optimal forged tag prediction sequence.

[0014] In this invention, when fraudsters use a large model to change keywords (such as names) in a sentence to generate fake audio with a similar timbre, this invention needs to determine the location of the forgery in the audio, but it does not require understanding the sentence itself. In this invention, if any part of the audio is identified as tampered with, then it is fake audio; if no part is identified as tampered with, then it is genuine audio.

[0015] In one specific implementation, step S1 specifically includes: converting the input audio waveform into a Mel spectrogram, inputting the Mel spectrogram into the EAT model, and extracting a high-dimensional acoustic feature sequence H. EAT Furthermore, the parameters of the EAT model are frozen during model training.

[0016] In step S1 of this invention, the input audio waveform is received and converted into a Mel spectrogram using digital signal processing algorithms such as Short Time Fourier Transform (STFT).

[0017] In step S1 of this invention, the EAT is an extractor used to extract the audio waveform to be detected. The EAT extracts a high-dimensional acoustic feature sequence and extracts the deep acoustic features for time-series modeling by the computer's conformer module in step S2.

[0018] In one specific implementation, step S2 specifically includes: capturing global dependencies using the multi-head self-attention mechanism in the Conformer module, capturing local features using the convolution module, and outputting intermediate features H. Conf ; the intermediate feature H Conf The input is the Bi-LSTM network, which generates the H through forward and backward recursive processing. BiLSTM .

[0019] In step S2 of this invention, the conformer module is used to extract global context dependency information from the information. The stacked conformer modules have three layers.

[0020] In one specific implementation, step S3 specifically includes: placing the H BiLSTM The emission fraction matrix S is generated by linear layer projection.

[0021] Where Linear represents a linear function, R represents the set of real numbers, T represents the total time, and K represents the number of label categories;

[0022] Construct a scoring function score(X,Y) that includes the emission score and the transition score A between tags: y represents all possible label sequences; For time frame t, the model assigns candidate labels y t The launch fraction; The candidate label y represents time frame t. t Candidate label y transferred to time frame t+1 t+1 The transition score is calculated; during the inference phase, the Viterbi algorithm is used to solve for the label sequence Y that maximizes the scoring function, which is then used as the final localization result.

[0023] In one specific implementation, the method further includes a model training step following step S1, wherein the model training step employs a boundary-aware hybrid loss function L. total Optimization is performed; the boundary-aware hybrid loss function is defined as follows:

[0024] Where λ is the balance coefficient, L CRF For the negative log-likelihood loss of the CRF layer, S factor w is the scaling factor. t For the boundary-aware weights of time frame t, l ce (S t ,y t S is the cross-entropy loss for time frame t, where S t Let y be the emission fraction vector corresponding to time frame t. t Let w be the actual label corresponding to time frame t; and w t Based on the time frame t and the nearest real / fake boundary b k Distance calculation: Where α is the enhancement intensity, σ is the Gaussian kernel width, k is the boundary index variable, and b k Specifically, it refers to the k-th real / fake boundary.

[0025] In one specific implementation, prior to step S1, a step of constructing highly concealed forged training data is included, specifically comprising: using a large language model to modify the content of the transcribed text of real speech, the modification strategies including negation modification, entity replacement, quantity modification, and detail injection; using a zero-shot speech synthesis model pool containing autoregressive and non-autoregressive architectures to synthesize forged speech segments based on the modified text and the original voiceprint; using forced alignment technology to obtain temporal boundaries, and using a smooth splicing algorithm to embed the forged segments into the original audio to generate training samples containing precise physical splicing forged boundaries.

[0026] This invention also provides a highly covert audio spoofing localization system based on hierarchical sequence labeling, comprising: an acoustic feature extraction module, used to perform step S1 as described above, receiving an input audio waveform and converting it into a high-dimensional acoustic feature sequence using a pre-trained upstream feature extractor; a temporal dependency modeling module, used to perform step S2 as described above, performing multi-scale modeling of the high-dimensional acoustic feature sequence through a Conformer module and a Bi-LSTM network to generate a context-aware hidden state sequence; and a structured sequence prediction module, used to perform step S3 as described above, using a CRF layer to decode the hidden state sequence using the Viterbi algorithm and outputting the optimal spoofing label prediction sequence.

[0027] In one specific implementation, the system further includes a model optimization module for performing the model training steps described above, calculating the boundary-aware hybrid loss, and updating the model parameters.

[0028] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method described above.

[0029] The present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described above.

[0030] Compared with the existing technology CN202310965261.6, the present invention has higher positioning accuracy and clearer boundaries; the present invention has stronger detection capability for highly concealed forgeries; and the present invention has clearer training objectives and better convergence.

[0031] Compared to the existing technology CN202410486972.X, the localization mechanism of this invention belongs to structured prediction. It combines Bi-LSTM output, emission fraction, CRF layer, and Viterbi decoding. This invention can accurately retain extremely short but semantically reasonable tampered segments (such as inserted single words) while suppressing real random noise; this has a significant technological advantage over median filtering. Specifically, the temporal modeling of this invention uses Conformer combined with Bi-LSTM to form a complete chain of "local features - global concern - long-range temporal sequence". This invention focuses on speech content tampering. The forgery of core speech content tampering involves word replacement and insertion, which will cause subtle breaks in prosody, intonation, and breath sounds; in this invention, the combination of Conformer + Bi-LSTM + CRF is precisely to capture such breaks in "linguistic logic". Compared to this invention, 1. This invention effectively solves the problem that "extremely short semantic segments" are easily deleted in median filtering. This invention uses "CRF" to accurately locate single-word level tampering, and the recall rate is significantly higher than that of this invention. When dealing with highly concealed forgeries, the variance in the length of the forged segments is extremely large, ranging from a single word to a whole sentence. Median filtering requires a fixed window size, which cannot adapt to this variation; a large window will miss short segments, while a small window will result in incomplete noise reduction. In this invention, CRF adaptively handles sequence dependencies, perfectly solving this problem. 2. This invention is essentially still a local decision-making process, which is prone to fragmented results. In contrast, Viterbi decoding in this invention is a globally optimal path search, which ensures that the output tag sequence has the highest probability at the level of the entire sentence, thereby eliminating logically contradictory predictions. 3. This invention utilizes the "Macaron" style feedforward network and convolutional module unique to the Conformer module, making it excellent at capturing subtle abrupt changes in audio waveforms (such as phase discontinuities at splicing points). Combined with the hard boundary constraints of CRF, this invention makes the start and end times of localization more accurate.

[0032] In summary, the beneficial effects of this invention are as follows: This invention proposes a hierarchical (EAT + Conformer + Bi-LSTM + CRF) sequence annotation architecture specifically designed to solve the novel task of locating highly concealed audio tampering driven by large language models (LLM). By combining a high-quality pre-trained EAT encoder, a powerful temporal encoder, and a structured prediction layer, this invention achieves efficient and accurate localization of the acoustic boundaries of forged segments, achieving industry-leading performance, primarily in the following aspects.

[0033] (1) Highly targeted: This invention is specifically designed for audio forgery attacks with high concealment of content tampering driven by large language models. By constructing training data containing a variety of core content tampering strategies (such as negation and entity substitution), the model can learn the general acoustic features of such advanced forgery methods.

[0034] (2) High positioning accuracy: By introducing a boundary-aware hybrid loss function, the model’s sensitivity to weak acoustic inconsistencies (such as phase discontinuity and sudden changes in background noise) at the splicing boundary is significantly enhanced, effectively reducing the boundary positioning error.

[0035] (3) Consistent prediction structure: The CRF layer is used to impose structural constraints on the label sequence, which overcomes the defect of traditional frame-level classification methods that are prone to prediction fragmentation (such as the appearance of incorrect "real" jumps in the middle of fake segments), and ensures the consistency of the localization results in temporal logic.

[0036] (4) Excellent generalization ability: The pre-trained EAT model is used as a feature extractor, combined with multi-scale temporal modeling, so that the system does not rely on specific speaker features or single TTS artifacts, thus maintaining robust detection and localization performance when facing unseen speakers and attacks using different synthesis techniques. Attached Figure Description

[0037] Figure 1 This is a flowchart of a highly covert audio forgery localization method based on hierarchical sequence annotation provided by the present invention.

[0038] Figure 2 This is a schematic diagram of the "partial tampering and forgery audio data construction process" before step S1 in this invention, forming the "audio to be detected".

[0039] Figure 3 This is a schematic diagram of the hierarchical sequence labeling model architecture of the present invention.

[0040] Figure 4 This is a visual illustration of the effect of locating a specific content tampering sample using the method of this invention. Detailed Implementation

[0041] To enable those skilled in the art to better understand the technical solution of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings.

[0042] The EAT, Conformer, Bi-LSTM, CRF, TTS, LLM, MFA and Viterbi used in this invention are all existing technologies known to those skilled in the art.

[0043] EAT: High-efficiency audio converter.

[0044] Conformer: The Conformer module.

[0045] Bi-LSTM: Bidirectional Long Short-Term Memory Network.

[0046] CRF: Conditional Random Field.

[0047] TTS: Text-to-Speech (TTS)

[0048] LLM: Large Language Model.

[0049] MFA: Montreal Forced Aligner.

[0050] Viterbi: The Viterbi algorithm.

[0051] The technical solution to achieve the objective of this invention is:

[0052] A highly covert audio forgery localization method based on hierarchical sequence annotation, such as Figure 1 As shown, it includes the following steps:

[0053] Step (1): Construction of Local Content Falsification Data and Preparation for Model Training: Construct an audio dataset containing multi-dimensional, highly concealed tampering types. Use a large language model to perform local content tampering on the transcribed text of real speech to generate adversarial text; use a zero-shot speech synthesis system to synthesize tampered segments, and use forced alignment technology to seamlessly splice them back into the original audio to form training data with precise timestamp annotations.

[0054] Step (2): Acoustic feature extraction: Receive the input audio waveform and use a pre-trained efficient audio converter (EAT) model as an upstream feature extractor to extract a high-dimensional acoustic feature sequence that is highly sensitive to non-speech artifacts from the audio waveform.

[0055] Step (3): Multi-scale temporal dependency modeling: The acoustic feature sequence is input into the downstream temporal encoder. First, local acoustic correlation and global contextual dependency are captured collaboratively through stacked Conformer modules; then, the long-range temporal structure and prosodic continuity of the entire sequence are modeled through a bidirectional long short-term memory network (Bi-LSTM) to generate a context-aware hidden state sequence.

[0056] Step (4): Structured Sequence Prediction: The hidden state sequence is mapped to an emission score matrix and input into a Conditional Random Field (CRF) layer. The CRF layer combines the emission scores with the transition scores between learnable labels and uses the Viterbi algorithm to decode and obtain a globally optimal, structurally consistent forged label prediction sequence.

[0057] Step (5): Optimization based on boundary-aware hybrid loss: During the model training phase, a boundary-aware mechanism is introduced to dynamically adjust the loss weights based on the distance between the time frame and the actual tampered boundary. The model is then jointly optimized by combining CRF loss and weighted cross-entropy loss to enhance the model's localization accuracy of the spliced ​​boundary.

[0058] Further, step (1) includes the following sub-steps: (1.1) Select a real speech corpus and use a large language model (LLM) to perform local content tampering on the original transcribed text according to a preset strategy, the preset strategy including negation transformation, entity replacement, quantity modification and detail injection; (1.2) Construct a zero-sample TTS model pool containing multiple architectures, and synthesize a fake speech segment that matches the context acoustic environment based on the tampered text and the original speaker's voiceprint features; (1.3) Use the Montreal forced aligner (MFA) to obtain the millisecond-level phoneme boundaries between the original audio and the synthesized segment, and use a smooth splicing algorithm to embed the synthesized segment into the corresponding position of the original audio to generate an audio sample containing local content tampering and the corresponding frame-level truth label sequence.

[0059] Furthermore, step (2) includes the following sub-steps:

[0060] (2.1) Convert the original audio waveform W into a Mel spectrogram X spec :

[0061] (2.2) For the Mel spectrum X spec Input the pre-trained high-efficiency audio converter encoder EAT to extract the high-dimensional acoustic feature sequence H. EAT :

[0062] Where T is the number of frames in the feature sequence, and D EAT This is the feature dimension output by the EAT encoder. During training, the EAT encoder parameters are frozen and used only as a feature extractor to leverage its ability to capture environmental noise and subtle artifacts learned during self-supervised pre-training.

[0063] Furthermore, step (3) includes the following sub-steps:

[0064] (3.1) The high-dimensional acoustic feature sequence H EAT The input is an encoder composed of several stacked Conformer modules. These Conformer modules combine self-attention mechanisms to capture global long-range dependencies with convolutional modules to efficiently model local feature correlations. This process is represented as:

[0065] Where H Conf This represents the output of the Conformer module, i.e., intermediate features; D Conf This represents the feature dimension output by the Conformer module.

[0066] (3.2) Convert the output H of the Conformer module Conf The data is passed to a multi-layer Bi-LSTM. The Bi-LSTM models the long-term temporal structure of the entire utterance through forward and backward processing sequences, generating the final context-aware hidden state H. BiLSTM :

[0067] Where D BiLSTM This represents the feature dimension of the Bi-LSTM output.

[0068] Furthermore, step (4) includes the following sub-steps:

[0069] (4.1) The output H of the Bi-LSTM BiLSTM A linear layer is used to generate the "emission fraction" matrix S for subsequent CRF layers:

[0070] Where K is the number of label categories (e.g., REAL and FAKE).

[0071] (4.2) A CRF layer is introduced to model the dependencies between adjacent labels and impose structural constraints. Given a label sequence Y=(y1, y2,…, y…), ... T ), its score (X) spec Y) is defined as the sum of the emission score from the network and the transition score between adjacent tags:

[0072] Where A∈R K×K It is a learnable transition matrix, A i,j This represents the transition score from label i to label j.

[0073] (4.3) In the reasoning process, the Viterbi algorithm is used for efficient decoding and achieves the highest score. Most likely label sequence:

[0074] The This is the final fake location result sequence, where y represents all possible label sequences, and Y' represents a candidate label sequence or traversal variable that belongs to the set y consisting of all possible label sequences generated by the system.

[0075] Furthermore, the calculation process of the boundary-aware hybrid loss in step (5) is as follows:

[0076] (5.1) Given input X spec The conditional probability p(Y|X) of sequence Y spec This is defined by applying softmax to the scores of all possible label sequences y:

[0077] (5.2) For each time frame t, calculate its boundary-aware weight w t The weight is based on the frame's closest real / fake boundary b. k Assign Gaussian decay values ​​based on time distance:

[0078] Where α is the boundary enhancement intensity coefficient and σ is the Gaussian kernel width, used to control the range of the region of interest;

[0079] (5.3) Constructing the boundary-aware hybrid loss function L total It is caused by the negative log-likelihood loss L of CRF. CRF Boundary weighted cross-entropy loss L BCE_Weighted composition:

[0080] Where λ is the balance coefficient, S factor l is the scaling factor. ce This is the standard cross-entropy loss function. Boundary-weighted cross-entropy loss L BCE_Weighted This boundary-aware hybrid loss function forces the model to focus on high-entropy splicing boundary regions while optimizing global structural consistency.

[0081] Example 1

[0082] This embodiment describes a highly covert audio spoofing location method.

[0083] like Figure 1 As shown in the figure, this embodiment provides a highly covert audio forgery localization method based on hierarchical sequence annotation. The method mainly includes five core steps: constructing highly covert local content forgery data, extracting acoustic features, multi-scale temporal dependency modeling, structured sequence prediction, and optimization training based on boundary-aware hybrid loss.

[0084] Step 1: Construction of highly concealed local content forgery data and preparation for model training.

[0085] In order to train the model to identify physical splicing traces caused by high-order content tampering driven by large models, this invention first constructs a dedicated local fake audio dataset (SegmentSpoof). Figure 2 In this invention Figure 1 The diagram illustrates the "local forged audio data construction process" preceding step S1, "acoustic feature extraction," which forms the "audio to be detected." This diagram shows an automated pipeline including "original speech, LLM content manipulation, TTS model pool synthesis, MFA forced alignment and smooth splicing, and forged sample generation," clearly defining the source and construction logic of highly concealed forged data. LLM content manipulation involves feeding the original speech into a large language model (LLM) for modification. Regarding TTS model pool synthesis, a sound template and a complete text segment are input into the TTS, which generates audio corresponding to the complete text segment. It uses the sound of the template, which is used in this invention. Figure 2 The method generates 85,424 fake audio samples, some of which are training sets, some are validation sets, and some are test sets, as detailed in Table 1. Each of these fake audio samples corresponds to one of 85,424 real samples, so the total number of real and fake samples involved in this invention is 170,848.

[0086] like Figure 2 As shown, the data construction process adopts an automated pipeline, and the specific implementation details are as follows.

[0087] 1.1 Corpus preprocessing and content manipulation.

[0088] The high-fidelity Chinese speech corpus (AISHELL-3) was selected as the source data. Audio segments with a duration between 1.5 seconds and 8.0 seconds were selected and uniformly resampled to 16kHz. Loudness normalization (e.g. -23 LUFS) was performed to eliminate the bias caused by volume differences.

[0089] Large language models (such as Qwen-3.5-Max) are used as content attack proxies. Specific prompts are designed to modify the original transcribed text using the following four strategies while maintaining the original speaker's tone and syntactic fluency.

[0090] Negation: Reversing the polarity of the meaning of a sentence; for example, changing "I agree" to "I disagree".

[0091] Entity Substitution: Replace key names of people, places, or organizations.

[0092] Quantitative Manipulation: Modifies specific numerical or time information.

[0093] Detail Injection: Adds fictitious contextual details.

[0094] 1.2. Diverse zero-sample speech synthesis.

[0095] To simulate the various tools that attackers might use in reality, this invention constructs a TTS model pool that includes autoregressive (AR) and non-autoregressive (NAR) architectures. Specifically, this includes, but is not limited to, advanced zero-shot speech synthesis models such as IndexTTS, CosyVoice, F5-TTS, and Spark-TTS.

[0096] The system performs zero-sample cloning synthesis based on the tampered text fragment and uses the original audio as a voiceprint reference. This multi-model strategy ensures that the generated forged fragments contain diverse vocoder artifacts, preventing the detection model from overfitting to artifacts from a single synthesizer.

[0097] 1.3 High-fidelity audio re-integration.

[0098] The Montreal Forced Aligner (MFA) is used to perform millisecond-level phoneme alignment between the original and synthesized audio, precisely determining the start and end points of the tampered words on the timeline. To simulate high-level attacks and increase the difficulty of detection, this invention employs multiple splicing techniques to embed the synthesized segments back into the original audio. Zero-CrossingCut: A hard cut is performed at the position where the waveform amplitude is zero, creating a certain discontinuity. Crossfade: An overlapping window is applied at the splicing boundary for a smooth transition, masking plosive sounds caused by phase mismatch. Overlap-Add: The discontinuity of the spectrum is smoothed through temporal superposition. Through the above steps, an audio sample X containing locally forged content and a corresponding frame-level truth label sequence Y are generated. true (Where 0 represents a real frame and 1 represents a fake frame).

[0099] 1.4. SegmentSpoof Dataset Metadata Display

[0100] As shown in Table 1, the specific statistical characteristics of this dataset are as follows.

[0101] Table 1

[0102] The dataset contains a total of 170,848 audio samples, with a total duration of approximately 160 hours. The ratio of genuine audio samples to forged samples containing partial content manipulation is strictly balanced at 1:1 (85,424 samples each), effectively preventing the model from overfitting to any one category.

[0103] It covers four advanced zero-shot speech synthesis systems with different architectures, including CosyVoice, F5-TTS, Spark-TTS, and IndexTTS, ensuring that the model can learn general synthesis artifacts rather than features of a single model; it covers four highly covert local content tampering strategies, including denial tampering, entity replacement, quantity tampering, and detail injection, with denial tampering accounting for the highest proportion, ensuring coverage of high-risk content reversal attacks.

[0104] For scientific evaluation, the dataset was divided into a training set (74,718 entries), a validation set (49,812 entries), and a test set (46,318 entries). A "no-speaker" subset was also included in the test set to evaluate the model's cross-speaker generalization ability.

[0105] Figure 3 This is a schematic diagram of the hierarchical sequence labeling model architecture of the present invention. The diagram shows in detail the connection relationship of the system components: the input audio is segmented and embedded by CNN and sent to the EAT encoder. The extracted features are sequentially passed through stacked Conformer modules (i.e., three-layer Conformer) and bidirectional LSTM network. Finally, the emission scores are output through the linear layer and decoded by the CRF layer to obtain the predicted label sequence, i.e., the 0 / 1 sequence.

[0106] Step 2: Acoustic feature extraction.

[0107] This step corresponds to Figure 1 S1 in the present invention. In a specific embodiment of the present invention, the system first receives the original audio waveform W.

[0108] 2.1 Waveform to Spectrum Conversion: The original audio waveform W is first converted into a Mel spectrogram X. spec .

[0109] 2.2 Upstream Feature Extractor: To obtain a highly discriminative acoustic representation, this invention preferably uses an efficient audio converter (EAT) model pre-trained through self-supervised learning as the upstream feature extractor.

[0110] The advantage of choosing this feature is that, compared to encoders like HuberT or WavLM, which are primarily designed for modeling speech and language content, the EAT model performs best in forgery localization tasks. Its advantage lies in the fact that, as a feature extractor optimized for non-speech signals, EAT is more sensitive to phase discontinuities and abrupt changes in ambient noise at splicing boundaries compared to ASR encoders optimized for phoneme recognition (such as WavLM). The feature representations it generates are more sensitive to "boundary features" such as non-speech artifacts and acoustic inconsistencies in background noise, which are precisely the most crucial physical evidence in content forgery driven by Large Language Models (LLMs), especially near splicing points.

[0111] 2.3 Feature Extraction Process: The Mel-spectrum X spec The high-dimensional acoustic feature sequence H is extracted from the EAT model. EAT Its dimensions are T×D EAT Where T is the number of time frames, and D EAT This indicates the dimensionality of the output features of the EAT model. In a preferred embodiment, the weights of the EAT model are frozen during model training, making it dedicated to serving as a fixed, efficient feature extractor, thereby reducing the training overhead for downstream tasks.

[0112] Step 3: Temporal dependency modeling.

[0113] This step corresponds to Figure 1 S2 in the middle. High-dimensional acoustic feature sequence H extracted from the upstream. EAT It is then fed into the downstream timing encoder.

[0114] 3.1 Conformer Module: In this embodiment, the encoder is initially a stack of 3 Conformer modules.

[0115] The advantage of choosing this feature lies in the fact that the Conformer architecture synergistically combines self-attention and convolution. The self-attention module is used to capture global, long-range contextual dependencies in the feature sequence; while the convolution module efficiently models local feature correlations (such as acoustic variations between adjacent frames). This dual mechanism is crucial for this task because it enables the model to simultaneously identify local artifacts at "sew-to-sew" points and perceive the "contextual inconsistencies" in prosody and rhythm between the forged segments and the preceding and following real speech.

[0116] 3.2 Bi-LSTM Network: Output H of the Conformer Module Conf It is then passed to a multi-layered bidirectional long short-term memory network.

[0117] The advantage of choosing this feature is that while the Conformer can already capture context, Bi-LSTM is used to further model the long-term temporal structure of the entire discourse. By processing sequences in both forward and backward directions, Bi-LSTM can generate hidden states with a deep understanding of global rhythm and intonation patterns, which is crucial for determining whether an inserted forged segment is "reasonable".

[0118] 3.3 State Output: The Bi-LSTM network ultimately generates a context-aware hidden state sequence H. BiLSTM .

[0119] Step 4: Structured sequence prediction.

[0120] This step corresponds to Figure 1S3 in the middle.

[0121] 4.1 Emission Fraction Generation: The output H of the Bi-LSTM BiLSTM First, a standard fully connected linear layer is used to generate the "emission score" matrix S. The matrix S has a dimension of T×K, where K is the number of label categories (in this embodiment, K=2, i.e., "REAL" and "FAKE").

[0122] 4.2 Conditional Random Field Layer: To overcome the defects of traditional softmax layers in predicting independent frames, which may produce physically illogical and highly fragmented tag sequences (e.g., "FAKE-REAL-FAKE"), this invention introduces a CRF layer as the final structured prediction layer.

[0123] The advantage of choosing this feature: The core advantage of the CRF layer lies in its ability to model the dependencies between adjacent labels and enforce structural constraints. The CRF layer comprehensively considers the "emission fraction" from the Bi-LSTM. And a "transition score matrix" A learned during training that represents the reasonableness of transitions between labels. i,j (For example, the transfer score from “REAL” to “FAKE”).

[0124] 4.3 Sequence Decoding: During the inference phase, the system employs the Viterbi algorithm to efficiently decode the optimal tag sequence with the highest combined score (emission score + transition score) from all possible tag paths. The sequence This is the final, structurally coherent fake location result.

[0125] Step 5: Optimization based on boundary-aware hybrid loss.

[0126] In order to overcome the problem that conventional cross-entropy loss does not pay enough attention to weak acoustic traces (such as phase discontinuity and small noise floor abrupt changes) at the splicing boundary during the model training phase, this embodiment designs a boundary-aware hybrid loss function.

[0127] 5.1 Boundary Weight Calculation: The system first determines the start and end time points of all forged segments based on the ground truth labels in the training data, denoted as the boundary set {b}. k For each time frame t in the input sequence, calculate its distance to the nearest boundary and assign Gaussian distribution weights w. t :

[0128] Here, α (e.g., 1.5) determines the weight factor of the boundary region relative to the non-boundary region, and σ (e.g., 3 frames) determines the width of the attention window. Figure 3 As shown, the weight curve reaches its peak at the actual tampering boundary, forcing the model to focus on learning the boundary features.

[0129] 5.2 Hybrid Loss Calculation and Backpropagation: Boundary-Aware Hybrid Loss Function L total It combines the sequence-level loss of CRF with a weighted frame-level loss:

[0130] Among them, L CRF = -log p(Y true | X spec ) is the negative log-likelihood of the true label sequence, used to optimize the sequence structure L. BCE_Weighted Based on weight w t The weighted cross-entropy loss is used to enhance the discriminative power of local features; the vertical lines in the formula represent conditional probabilities. By minimizing this total loss, the system can simultaneously achieve accurate frame-level classification and reasonable sequence structure prediction capabilities.

[0131] Figure 4 This is a visual illustration of the effect of applying the method of this invention to locate a specific core local content tampering sample. Figures A, B, and C respectively show the original audio waveform, the actual tampering annotation, and the model prediction result output by this invention. The red shading represents the actual tampering interval, and the blue shading represents the model prediction interval. Figures B and C show a high degree of consistency between the actual tampering interval and the model prediction interval, indicating that the method described in this invention is accurate and effective.

[0132] Example 2

[0133] This embodiment provides a highly covert audio spoofing and location system for implementing the above method, the system including the following modules.

[0134] Data Construction Module. This module is responsible for receiving raw speech corpora, calling large language models to perform local content manipulation to generate adversarial text, scheduling zero-shot TTS model pools to synthesize fake speech, and using forced alignment and smooth splicing techniques to generate a dataset of fake audio with precise annotations.

[0135] Acoustic Feature Extraction Module: This module has a built-in pre-trained EAT model, which is used to convert the input audio waveform to be detected into a high-dimensional acoustic feature sequence that is sensitive to environmental noise and non-speech artifacts.

[0136] Temporal Dependency Modeling Module: This module contains stacked Conformer submodules and Bi-LSTM submodules, which are used to capture local convolutional features, global attention dependencies, and long-range temporal structures of feature sequences, respectively.

[0137] Structured Sequence Prediction Module: This module contains a linear projection layer and a CRF layer, which are used to calculate the emission score and the transition score, and use the Viterbi decoder to output the final fake location tag sequence.

[0138] Model Optimization Module: This module is enabled only during the training phase. It is used to calculate the boundary-aware hybrid loss and update the trainable parameters of the above modules (except for the frozen EAT parameters) through the backpropagation algorithm.

[0139] Example 3

[0140] This embodiment is an application verification of the above embodiments. To verify the beneficial effects of the embodiments of the present invention, a comprehensive performance evaluation was performed on the constructed SegmentSpoof test set, and it was compared with the current mainstream audio forgery detection methods. As shown in Table 2, Table 2 shows the quantitative comparison results of the method of the present invention with baseline models such as RawNet2, AASIST, WavLM+Conformer+CRF, and CFPRF.

[0141] Table 2

[0142] This invention achieves a significant leading advantage in key metrics for measuring the accuracy of tampered region localization. Specifically, the event-level F1 score (event-F1) reaches 0.972, an improvement of approximately 2.1% compared to the suboptimal CFPRF model (0.952); and the more stringent fragment-level F1 score (fragment-F1) reaches 0.957, significantly outperforming improved versions of traditional binary classification models (RawNet2 and AASIST scores are only 0.341 and 0.381, respectively). This demonstrates the superior performance of the hierarchical sequence labeling architecture in capturing the boundaries of local content tampering.

[0143] In terms of the Editing Error Rate (EER), a metric that measures the ability to classify audio segments as genuine or fake, this invention achieves a minimum error rate of 1.99%, which is better than CFPRF's 2.09% and the WavLM baseline's 6.23%. This means that this invention can not only accurately locate the tampering position, but also has the highest reliability in correctly determining whether the entire audio segment has been tampered with.

[0144] Comparative data shows that baseline models without boundary-aware loss and CRF layers (such as WavLM+Conformer+CRF) significantly lag behind the present invention in fragment-F1 (0.883) and frame-F1 (0.876) metrics, further verifying the effectiveness of the boundary-aware hybrid loss function proposed in this invention in enhancing the ability to identify splicing boundaries.

[0145] The positioning method and system provided by this invention have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this invention. The descriptions of the embodiments above are merely for the purpose of helping to understand the core ideas of this invention. It should be noted that those skilled in the art can make various improvements and modifications to this invention without departing from its principles, and these improvements and modifications also fall within the protection scope of the claims of this invention.

[0146] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A highly covert audio forgery localization method based on hierarchical sequence annotation, characterized in that, Includes the following steps: Step S1, Acoustic Feature Extraction: Receive the input audio waveform, convert it into a Mel spectrogram, and extract high-dimensional acoustic feature sequences using a pre-trained upstream feature extractor; the upstream feature extractor is a self-supervised pre-trained high-efficiency audio converter model, i.e., the EAT model. Step S2, Temporal Dependency Modeling: The high-dimensional acoustic feature sequence is input into a downstream temporal encoder, which includes a serially connected Conformer encoder and a Bi-LSTM network. The Bi-LSTM network is a bidirectional long short-term memory network. The stacked Conformer modules and Bi-LSTM network are used to model the local correlation, global contextual dependency, and long-range temporal structure of the high-dimensional acoustic feature sequence, generating a context-aware hidden state sequence H. BiLSTM ; Step S3, Structured Sequence Prediction: The H... BiLSTM The emission scores are mapped to emission scores, and a structured prediction layer, which is a conditional random field layer (CRF layer), is used. The CRF layer combines the emission scores with the transition scores between tags and uses the Viterbi algorithm to decode and obtain the optimal forged tag prediction sequence.

2. The method according to claim 1, characterized in that, Step S1 specifically includes: converting the input audio waveform into a Mel spectrogram, inputting the Mel spectrogram into the EAT model, and extracting the high-dimensional acoustic feature sequence H. EAT Furthermore, the parameters of the EAT model are frozen during model training.

3. The method according to claim 1, characterized in that, Step S2 specifically includes: capturing global dependencies using the multi-head self-attention mechanism in the Conformer module, capturing local features using the convolution module, and outputting intermediate features H. Conf ; the intermediate feature H Conf The input is the Bi-LSTM network, which generates the H through forward and backward recursive processing. BiLSTM .

4. The method according to claim 1, characterized in that, Step S3 specifically includes: taking the H BiLSTM The emission fraction matrix S is generated by linear layer projection. Where Linear represents a linear function, R represents the set of real numbers, T represents the total time, and K represents the number of label categories; Construct a scoring function score(X,Y) that includes the emission score and the transition score A between tags: y represents all possible label sequences; This indicates that at time frame t, the model assigns candidate label y t The launch fraction; The candidate label y represents time frame t. t Candidate label y transferred to time frame t+1 t+1 The transition score is calculated; during the inference phase, the Viterbi algorithm is used to solve for the label sequence Y that maximizes the scoring function, which is then used as the final localization result.

5. The method according to claim 4, characterized in that, The method further includes a model training step following step S1, wherein the model training step employs a boundary-aware hybrid loss function L. total Optimization is performed; the boundary-aware hybrid loss function is defined as follows: Where λ is the balance coefficient, L CRF For the negative log-likelihood loss of the CRF layer, S factor w is the scaling factor. t For the boundary-aware weights of time frame t, l ce (S t ,y t S is the cross-entropy loss for time frame t, where S t Let y be the emission fraction vector corresponding to time frame t. t Let w be the actual label corresponding to time frame t; and w t Based on the time frame t and the nearest real / fake boundary b k Distance calculation: Where α is the enhancement intensity, σ is the Gaussian kernel width, k is the boundary index variable, and b k Specifically, it refers to the k-th real / fake boundary.

6. The method according to claim 5, characterized in that, Prior to step S1, a step of constructing highly concealed forged training data is included, specifically: using a large language model to modify the content of the transcribed text of real speech, with modification strategies including negation modification, entity replacement, quantity modification, and detail injection; using a zero-shot speech synthesis model pool containing autoregressive and non-autoregressive architectures to synthesize forged speech segments based on the modified text and the original voiceprint; using forced alignment technology to obtain temporal boundaries, and using a smooth splicing algorithm to embed the forged segments into the original audio to generate training samples containing precise physical splicing forged boundaries.

7. A highly covert audio spoofing localization system based on hierarchical sequence labeling, characterized in that, include: An acoustic feature extraction module is used to perform step S1 as described in any one of claims 1 to 6, receiving an input audio waveform and converting it into a high-dimensional acoustic feature sequence using a pre-trained upstream feature extractor; The temporal dependency modeling module is used to perform step S2 as described in any one of claims 1 to 6, and to perform multi-scale modeling of the high-dimensional acoustic feature sequence through the Conformer module and the Bi-LSTM network to generate a context-aware hidden state sequence. The structured sequence prediction module is used to perform step S3 as described in any one of claims 1 to 6, using the CRF layer to decode the hidden state sequence using the Viterbi algorithm, and outputting the optimal forged label prediction sequence.

8. The system according to claim 7, characterized in that, The system further includes a model optimization module for performing the model training steps as described in claim 5 or 6, calculating the boundary-aware hybrid loss and updating the model parameters.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method as described in any one of claims 1 to 6.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps of the method as described in any one of claims 1 to 6.