Voice confrontation sample generation method and device, equipment, storage medium and product
By obtaining the conditional prediction probability of word units through the target speech recognition model, generating energy penalty weights and iteratively optimizing them, speech adversarial examples with good concealment and energy saving are generated, solving the problems of insufficient concealment and energy waste in the existing technology and improving the security of the speech recognition system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA UNITED NETWORK COMM GRP CO LTD
- Filing Date
- 2026-02-14
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies generate adversarial speech samples that are either not sufficiently covert or wasteful of resources, making it impossible to effectively test the security vulnerabilities of automatic speech recognition systems.
By obtaining the conditional prediction probability of word units through the target speech recognition model, generating energy penalty weights, constructing a loss function and iteratively optimizing it, speech adversarial examples that are difficult for the human ear to detect are generated, ensuring the recognition success rate of the speech recognition system.
The generated adversarial speech examples significantly reduce human auditory perception without affecting the recognition success rate, solving the problems of insufficient concealment and energy waste, and improving the anti-attack capability of the speech recognition system.
Smart Images

Figure CN122245289A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a method, apparatus, device, storage medium, and product for generating adversarial speech samples. Background Technology
[0002] With the rapid development of artificial intelligence technology, automatic speech recognition systems have been widely applied in smart homes, in-vehicle navigation, mobile devices, and security monitoring. These applications rely on automatic speech recognition systems to accurately convert speech signals into text for human-computer interaction. However, the deep neural network structure of automatic speech recognition systems is vulnerable to adversarial attacks. Attackers can generate covert adversarial examples by superimposing tiny perturbations into the original audio, which are difficult for the human ear to detect. This can induce the automatic speech recognition system to misinterpret malicious commands, such as misinterpreting turning off lights as opening a door, thus posing a serious security risk. Therefore, it is necessary to generate adversarial examples to accurately test the security vulnerabilities of automatic speech recognition systems and simultaneously promote the upgrading of the anti-attack capabilities of speech recognition technology systems.
[0003] Currently, existing technologies can cause automatic speech recognition systems to misjudge by adding noise; or they can use psychoacoustic attack methods to exploit the masking effect of the human auditory system to limit adversarial perturbations to the time and frequency range that are inaudible to the human ear.
[0004] However, existing methods for generating adversarial speech examples generally suffer from insufficient concealment or wasted energy. Summary of the Invention
[0005] This application provides a method, apparatus, device, storage medium, and product for generating adversarial speech samples, in order to address the issues of insufficient concealment or wasted energy in adversarial speech samples.
[0006] In a first aspect, embodiments of this application provide a method for generating adversarial speech examples, applied to an electronic device, comprising: acquiring target instruction text and a target speech recognition model; wherein the target instruction text is a malicious instruction recognized by a speech recognition system; the target speech recognition model is a speech recognition model used by the speech recognition system; initializing an audio signal; acquiring the conditional prediction probability of each word unit in the target instruction text through the target speech recognition model; generating energy penalty weights aligned with the time axis of the audio signal based on the conditional prediction probabilities of each word unit in the target instruction text; determining a weight vector based on the energy penalty weights of each word unit on the time axis; constructing a loss function for the audio signal and the weight vector, and generating adversarial speech examples through iterative optimization; wherein the loss function is used to guide the optimization process.
[0007] In one possible implementation, the conditional prediction probability of each word unit in the target instruction text is obtained through a target speech recognition model, including: for any word unit, predicting the current word unit based on the word units preceding the current word unit, and obtaining the conditional prediction probability of the current word unit.
[0008] In one possible implementation, generating energy penalty weights aligned with the time axis of the audio signal based on the conditional prediction probabilities of each word unit in the target instruction text includes: mapping each word unit in the target instruction text onto the time axis of the audio signal; and determining the energy penalty weight of each word unit on the time axis based on the conditional prediction probabilities of each word unit in the target instruction text.
[0009] In one possible implementation, the energy penalty weight of each word unit is determined on the time axis based on the conditional prediction probability of each word unit in the target instruction text, using the following formula:
[0010]
[0011] in, This represents the energy penalty weight of the i-th word unit in the target instruction text. This represents the conditional prediction probability of the i-th word unit in the target instruction text. Represents the probability weighting coefficient. Indicates the base offset. This represents the i-th word unit in the target instruction text. This refers to the word unit preceding the i-th word unit in the target instruction text.
[0012] In one possible implementation, the weight vector is determined based on the energy penalty weight of each word unit on the time axis, using the following formula:
[0013]
[0014] in, This represents the weight vector at time t. This represents the energy penalty weight of the i-th word unit in the target instruction text. Indicates the start time of the i-th word unit. This represents the end time of the i-th word unit, and N represents the number of word units in the target instruction text.
[0015] In one possible implementation, the loss function for constructing the audio signal and weight vector is formulated as follows:
[0016]
[0017]
[0018] in, This represents the value of the loss function. Indicates the identification loss value. Let x represent the recognition function, and let x represent the audio signal. This refers to the perturbation signal superimposed on the audio signal. Represents the target instruction text; where, This represents the energy suppression loss value. This represents the number of moments on the timeline, where t represents the t-th moment on the timeline. This represents the weight vector at time t. This represents the disturbance signal superimposed on the audio signal at time t. express The square of the L2 norm.
[0019] In one possible implementation, speech adversarial examples are generated through iterative optimization, including: iteratively executing the following steps: updating the perturbation signal using a gradient descent algorithm; obtaining a new loss function value based on the updated perturbation signal; determining the audio signal and the updated perturbation signal as initial speech adversarial examples; inputting the initial speech adversarial examples into a target speech recognition model, causing the target speech recognition model to output the confidence level of recognizing the initial speech adversarial examples as target instruction text; obtaining the energy of the initial speech adversarial examples; where energy refers to acoustic intensity; if the difference between the new loss function value and the loss function value in the previous iteration is less than a preset loss function threshold, and the confidence level is greater than a preset confidence threshold, and the energy is lower than a preset energy threshold, then the iteration is stopped, and the audio signal and the latest perturbation signal are determined as speech adversarial examples.
[0020] Secondly, embodiments of this application provide a speech adversarial example generation apparatus, applied to an electronic device, comprising: a first acquisition module for acquiring target instruction text and a target speech recognition model; wherein the target instruction text is a malicious instruction recognized by a speech recognition system; an initialization module for initializing an audio signal; a second acquisition module for acquiring the conditional prediction probability of each word unit in the target instruction text through the target speech recognition model; a first generation module for generating energy penalty weights aligned with the time axis of the audio signal based on the conditional prediction probabilities of each word unit in the target instruction text; a determination module for determining a weight vector based on the energy penalty weights of each word unit on the time axis; and a second generation module for constructing a loss function of the audio signal and the weight vector, and generating speech adversarial examples through iterative optimization; wherein the loss function is used to guide the optimization process.
[0021] Thirdly, embodiments of this application provide an electronic device, including: a memory and a processor;
[0022] The memory stores instructions that the computer executes;
[0023] The processor executes computer execution instructions stored in memory, causing the processor to perform the first aspect and / or various possible implementations of the first aspect as described above.
[0024] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the first aspect and / or various possible implementations of the first aspect.
[0025] Fifthly, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the first aspect and / or various possible implementations of the first aspect.
[0026] The speech adversarial example generation method, apparatus, device, storage medium, and product provided in this application obtain the conditional prediction probability of each word unit in the target instruction text through a target speech recognition model. The conditional prediction probability determines which word units are easily identifiable and which are difficult to identify. Based on the conditional prediction probability, an energy penalty weight is generated for each word unit on the time axis of the audio signal. When the conditional prediction probability is high, it indicates that the speech recognition system can infer the word unit semantically without relying on high acoustic feature energy, i.e., perturbation energy; therefore, the corresponding energy penalty weight is also large. When the conditional prediction probability is low, it indicates that the speech recognition system has difficulty predicting the next word unit from the previous word unit and needs to retain a certain amount of acoustic feature energy for recognition; therefore, the corresponding energy penalty weight is small. The energy penalty weight is then allocated to the corresponding time points on the audio signal time axis through a weight vector. A loss function is constructed based on the audio signal and weight vector, and adversarial speech examples are generated through iterative optimization. The loss function includes energy suppression logic associated with the weight vector, which forces a reduction in perturbation energy during periods corresponding to high-energy penalized weights (i.e., word units with high conditional prediction probability). During periods corresponding to low-energy penalized weights (i.e., word units with low conditional prediction probability), only the minimum energy required for recognition is retained. The resulting adversarial speech examples do not produce significant noise due to excessive energy, thus addressing the issues of insufficient concealment or energy waste in existing technologies. Attached Figure Description
[0027] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.
[0028] Figure 1 A schematic diagram illustrating a scenario for a method of generating adversarial speech examples provided in an embodiment of this application;
[0029] Figure 2 A flowchart illustrating the method for generating adversarial speech samples provided in this application embodiment;
[0030] Figure 3 A schematic diagram illustrating the energy suppression principle based on conditional prediction probability provided in this application embodiment;
[0031] Figure 4 A schematic diagram of the structure of the speech adversarial sample generation device provided in the embodiments of this application;
[0032] Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application.
[0033] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation
[0034] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.
[0035] Figure 1 This is a schematic diagram of a scenario for the speech adversarial example generation method provided in the embodiments of this application, such as... Figure 1 As shown, it includes: a receiving device 101, a processing device 102, and a display device 103.
[0036] It is understood that the structures illustrated in the embodiments of this application do not constitute a specific limitation on the method for generating adversarial speech examples. In other feasible embodiments of this application, the above architecture may include more or fewer components than illustrated, or combine some components, or split some components, or arrange different components, which can be determined according to the actual application scenario and is not limited here. Figure 1 The components shown can be implemented in hardware, software, or a combination of both.
[0037] In the specific implementation process, the receiving device 101 can be an input / output interface or a communication interface, and can acquire the target instruction text and the target speech recognition model.
[0038] The processing device 102 can initialize an audio signal; obtain the conditional prediction probability of each word unit in the target instruction text through a target speech recognition model; generate energy penalty weights aligned with the time axis of the audio signal based on the conditional prediction probability of each word unit in the target instruction text; determine the weight vector based on the energy penalty weights of each word unit on the time axis; construct a loss function for the audio signal and the weight vector, and generate adversarial speech examples through iterative optimization; wherein the loss function is used to guide the optimization process.
[0039] The display device 103 can be used to display voice samples.
[0040] It should be understood that the aforementioned processor can be implemented by reading instructions from memory and executing those instructions, or it can be implemented through chip circuitry.
[0041] This application can be widely applied to scenarios that rely on voice recognition systems, such as smart homes, in-vehicle voice control systems, and security monitoring. For example, in smart homes, attackers may trigger operations such as turning on the air conditioner or unlocking the door using concealed voice commands; in in-vehicle systems, attackers may mislead navigation routes or control in-vehicle devices by disguising voice commands. This application generates adversarial voice samples that are difficult for the human ear to detect, accurately testing the security vulnerabilities of automatic voice recognition systems, while simultaneously promoting the upgrade of the anti-attack capabilities of voice recognition technology systems.
[0042] Furthermore, the network architecture and business scenarios described in the embodiments of this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided in the embodiments of this application. As those skilled in the art will know, with the evolution of network architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.
[0043] To address the aforementioned technical problems, this application proposes the following technical concept: The inventors analyzed the structural characteristics of speech recognition systems: the language model corrects the acoustic model output by predicting probabilities based on context, especially significantly reducing the dependence on acoustic features in semantically redundant regions. Based on this, the inventors propose that actively reducing the acoustic feature energy in regions with high conditional prediction probabilities of the target speech recognition model can significantly reduce human auditory perceptibility while maintaining the recognition success rate. Subsequently, the inventors obtain energy penalty weights that are positively correlated with conditional prediction probabilities, and then allocate the energy penalty weights to the corresponding moments on the audio signal time axis through weight vectors. Introducing weight vectors into the loss function guides the optimization process, causing the acoustic energy in high prediction probability regions to approach zero, while retaining necessary perturbations in low prediction probability regions. Finally, the adversarial speech examples generated through iterative optimization achieve a balance between "imperceptible to the human ear" and "recognizable by machines," solving the problems of insufficient concealment and energy waste in existing technologies.
[0044] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems are described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will now be described with reference to the accompanying drawings.
[0045] Figure 2 This is a flowchart illustrating the method for generating adversarial speech examples provided in an embodiment of this application, as shown below. Figure 2 As shown, the method includes:
[0046] S201: Obtain the target instruction text and the target speech recognition model; wherein the target instruction text is a malicious instruction identified by the speech recognition system; and the target speech recognition model is the speech recognition model used by the speech recognition system.
[0047] In this embodiment, the target speech recognition model is a speech recognition system that includes an acoustic model and a language model.
[0048] In this embodiment, the target instruction text refers to a malicious instruction that the attacker wants the voice recognition system to recognize, such as opening a door.
[0049] In this embodiment, full access to the target speech recognition model is granted.
[0050] S202: Initialize an audio signal.
[0051] In this embodiment, an audio signal without interference carrier is initialized to prepare for subsequent superimposed perturbation, ensuring that the initial audio signal itself does not affect the concealment.
[0052] Specifically, the audio signal length is set according to the typical pronunciation duration of the target instruction text, such as approximately 2 seconds for the "open door" sound, i.e., L = 2 seconds. Following the input requirements of the target speech recognition model, for example, the sampling rate is 16000Hz, meaning 16000 data points are collected per second.
[0053] In this embodiment, the initialization strategies are selected. Strategy 1: All-zero vector (silence), directly generating an array of zero values with a length equal to the sampling rate, corresponding to complete silence with no inherent acoustic features. Strategy 2: Extremely low amplitude Gaussian white noise, generating a random array with a mean of 0 and a very small variance, with an amplitude below the human hearing threshold, sounding close to silence.
[0054] S203: Obtain the conditional prediction probability of each word unit in the target instruction text through the target speech recognition model.
[0055] Specifically, for any given word unit, based on the word units preceding the current word unit, a prediction is made for the current word unit to obtain the conditional prediction probability of the current word unit.
[0056] In this embodiment, the language model of the target speech recognition model is preferentially used to predict the current word unit and obtain the conditional prediction probability of the current word unit. If the language model cannot be accessed, an external pre-trained language model can be selected.
[0057] In this embodiment, the target instruction text is T={w1,w2,…,w N}, w1 represents the first word unit, w2 represents the second word unit, w N This represents the Nth word unit.
[0058] For example, for the i-th word unit w i Based on w i The preceding word unit H i ={w1,w2,…,w i-1}, predict the i-th word unit as w i The conditional prediction probability. For example, w1 = open, as the beginning of a sentence, the predicted probability may be low, let's call it P1. w2 = door, after open, the probability of the door appearing is extremely high, let's call it P2, in this case P2 is usually greater than P1.
[0059] S204: Generate energy penalty weights aligned with the time axis of the audio signal based on the conditional prediction probabilities of each word unit in the target instruction text.
[0060] Specifically, step S204 includes S2041~S2042:
[0061] S2041: Map each word unit in the target instruction text to the time axis of the audio signal.
[0062] Alternatively, speech processing tools can be used to determine the corresponding moment of each word unit on the audio signal through forced alignment or attention mechanisms.
[0063] For example: the total length of the audio signal is 2 seconds, w1 = open, corresponding to 0~0.8 seconds, w2 = gate, corresponding to 0.8~2 seconds.
[0064] S2042: On the time axis, determine the energy penalty weight of each word unit based on the conditional prediction probability of each word unit in the target instruction text.
[0065] Specifically, based on the conditional prediction probability of each word unit in the target instruction text, the energy penalty weight of each word unit is set, and the formula is as follows:
[0066]
[0067] in, This represents the energy penalty weight of the i-th word unit in the target instruction text. This represents the conditional prediction probability of the i-th word unit in the target instruction text. Represents the probability weighting coefficient. Indicates the base offset. This represents the i-th word unit in the target instruction text. This refers to the word unit preceding the i-th word unit in the target instruction text.
[0068] Here, conditional prediction probability refers to the probability that the language model predicts the next word based on historical information, such as P(door|open). Historical information refers to the prediction constraints of the word units already generated in the target instruction text on subsequent word units.
[0069] In this embodiment, the energy penalty weight is an audio energy suppression coefficient that is dynamically adjusted based on the conditional prediction probability and is positively correlated with the conditional prediction probability. For example, the conditional prediction probability corresponds to a high energy penalty weight.
[0070] In this embodiment, a high conditional prediction probability indicates that the speech recognition system can infer the word unit semantically without relying on high acoustic feature energy, i.e., perturbation energy. Therefore, the corresponding energy penalty weight is also larger.
[0071] In this embodiment, a low conditional prediction probability indicates that the speech recognition system has difficulty predicting the next word unit based on the preceding word unit, such as opening the beginning of a sentence. These words are key information in the instruction, but historical information lacks sufficient redundancy, requiring the retention of a certain amount of acoustic feature energy for recognition. Therefore, the corresponding energy penalty weight is relatively small.
[0072] In this embodiment, the probability weighting coefficient and the basic offset are determined and adjusted according to experiments, and can take the following values. , .
[0073] For example, , If w1 = open, the conditional prediction probability is P1 = 0.08; if w2 = door, the conditional prediction probability is P2 = 0.93. Then... , .
[0074] S205: Determine the weight vector based on the energy penalty weight of each word unit on the time axis.
[0075] Specifically, the formula for determining the weight vector based on the energy penalty weight of each word unit on the time axis is as follows:
[0076]
[0077] in, This represents the weight vector at time t. This represents the energy penalty weight of the i-th word unit in the target instruction text. Indicates the start time of the i-th word unit. This represents the end time of the i-th word unit, and N represents the number of word units in the target instruction text.
[0078] In this embodiment, a weight vector with the same length as the audio signal is constructed. For example, at a sampling rate of 16000Hz, a 2-second audio clip corresponds to 32000 t values. If time t falls within w... i time interval Within the time period, the weight vector for all times is w. i The corresponding V i .
[0079] For example: t = 0.5 seconds, which belongs to the interval w1, λ(0.5) = 0.45; t = 1.5 seconds, which belongs to the interval w2, λ(1.5) = 4.7.
[0080] S206: Construct a loss function for the audio signal and weight vector, and generate adversarial speech examples through iterative optimization; the loss function is used to guide the optimization process.
[0081] In this embodiment, the formula for the loss function that constructs the audio signal and weight vector is:
[0082]
[0083]
[0084] in, This represents the value of the loss function. Indicates the identification loss value. Let x represent the recognition function, and let x represent the audio signal. This refers to the perturbation signal superimposed on the audio signal. Indicates the target instruction text.
[0085] in, This represents the energy suppression loss value. This represents the number of moments on the timeline, where t represents the t-th moment on the timeline. This represents the weight vector at time t. This represents the disturbance signal superimposed on the audio signal at time t. express The square of the L2 norm.
[0086] The perturbation signal superimposed on the audio signal is essentially a signal vector with the same length as the audio signal, corresponding to the instantaneous amplitude value at each moment on the time axis. It is a tiny acoustic perturbation specially designed to induce misjudgment by the speech recognition system. Its amplitude is extremely small, and its initial state is close to 0, such as 0~0.01. After being superimposed on the audio signal, it will not cause noticeable noise, but it can accurately change the acoustic characteristics of the audio, allowing the speech recognition system to recognize the audio signal and the perturbation signal superimposed on the audio signal as the target instruction text.
[0087] In this embodiment, the identification loss function L ASR Connectionist Temporal Classification Loss (CTC Loss) and other methods are selected to characterize the probability that audio signals and perturbation signals are recognized as target instruction text by the target speech recognition model. The difference between the recognition results of audio signals and perturbation signals and target instruction text is quantified. The greater the difference, the higher the recognition loss value.
[0088] In this embodiment, the energy suppression loss function L Energy This method applies a high-intensity energy penalty, forcing the acoustic energy of the perturbation signal in that region to approach zero, making it imperceptible to the human ear without affecting the recognition results of the speech recognition system. The calculation method is the product of the square of the L2 norm of the perturbation signal superimposed on the audio signal at time t and the weight vector.
[0089] In this embodiment, the following steps are iteratively executed: The gradient descent algorithm is used to update the perturbation signal; a new loss function value is obtained based on the updated perturbation signal; the audio signal and the updated perturbation signal are determined as initial adversarial speech samples; the initial adversarial speech samples are input into the target speech recognition model, causing the target speech recognition model to output the confidence score of recognizing the initial adversarial speech samples as target command text; the energy of the initial adversarial speech samples is obtained; where energy refers to acoustic intensity. If the difference between the new loss function value and the loss function value in the previous iteration is less than a preset loss function threshold, and the confidence score is greater than a preset confidence threshold, and the energy is lower than a preset energy threshold, then the iteration stops, and the audio signal and the latest perturbation signal are determined as adversarial speech samples.
[0090] The energy threshold is a pre-set value based on the psychoacoustic threshold. The psychoacoustic threshold refers to the critical value of sound intensity that the human ear can perceive, determined based on characteristics of the human auditory system, such as masking effects and frequency sensitivity.
[0091] In this embodiment, the following operations are performed repeatedly until the condition for stopping the iteration is met:
[0092] a) Use the gradient descent algorithm to calculate the gradient of the loss function value with respect to the perturbation signal, and update the perturbation signal in the opposite direction of the gradient.
[0093] b) The audio signal and the updated perturbation signal are used as the initial adversarial speech sample. The initial adversarial speech sample is input into the target speech recognition model to obtain the confidence score of the recognized target instruction text, such as the probability value output by the model.
[0094] c) Calculate the energy of the initial adversarial speech sample.
[0095] d) Calculate the current loss function value and the difference between the current loss function value and the loss function value in the previous iteration.
[0096] In this embodiment, the condition for stopping iteration is:
[0097] Condition 1: If the difference is less than the preset loss function threshold, it indicates that the optimization tends to be stable.
[0098] Condition 2: If the confidence level is greater than the preset confidence threshold, it means that the target speech recognition system can stably recognize the initial target instruction text.
[0099] Condition 3: The energy is lower than the preset energy threshold, which means it is lower than the upper limit of energy perceived by the human ear.
[0100] In this embodiment, iteration stops when the above three conditions are met. The latest perturbation signal and audio signal from the last iteration are identified as the speech adversarial sample.
[0101] In this embodiment, the perturbation signal is iteratively adjusted using a dual-constraint loss function, so that the audio signal and the perturbation signal are both recognized as target instruction text by the speech recognition system and are not perceived by the human ear.
[0102] For example, refer to Figure 3 , Figure 3 This is a schematic diagram illustrating the energy suppression principle based on conditional prediction probability, provided for an embodiment of this application. Figure 3 As shown, it is divided into three parts. When the generated adversarial speech sample is an open door, from top to bottom, they are the conditional prediction probability at different times, the weight vector at different times, and the adversarial audio waveform at different times.
[0103] Referring to Part 1, the conditional prediction probability of each word unit is displayed with time as the horizontal axis. The word unit "open" is at the beginning of the sentence, and the preceding word unit is empty, so the conditional prediction probability is low, and acoustic features need to be relied upon in this case; the word unit "door" is preceded by the word unit "open", which conforms to language habits, so the conditional prediction probability is high.
[0104] Referring to Part 2, when the conditional prediction probability is low, acoustic features are relied upon, and a certain amount of acoustic feature energy needs to be retained for recognition. Therefore, the corresponding energy penalty weight is small, meaning the corresponding weight vector is small. When the conditional prediction probability is high, it relies on the preceding word units, and there is no need to rely on high acoustic feature energy. Therefore, the corresponding energy penalty weight is also large, meaning the corresponding weight vector is also large.
[0105] Referring to Part 3, the weight vector directly determines the degree of energy suppression. In the low-weight vector range, energy is not strongly suppressed, and the waveform amplitude is relatively obvious, retaining necessary perturbations for the acoustic model in the speech recognition system to recognize the signal. In the high-weight vector range, energy is strongly suppressed, the waveform amplitude approaches zero, almost becoming silent, and imperceptible to the human ear.
[0106] Therefore, the method of this application can adaptively adjust the energy suppression intensity based on the conditional prediction probability of each word unit. When the conditional prediction probability is high, it does not need to rely on high acoustic feature energy, thus avoiding energy waste. When the conditional prediction probability is low, it can maximize both the recognition rate of the speech recognition system and the concealment.
[0107] In summary, by using a target speech recognition model, the conditional prediction probability of each word unit in the target instruction text is obtained. This probability determines which word units are easily identifiable and which are difficult to identify. Based on the conditional prediction probability, an energy penalty weight is generated for each word unit on the time axis of the audio signal. A high conditional prediction probability indicates that the speech recognition system can semantically infer the word unit without relying on high acoustic feature energy (perturbation energy), thus resulting in a larger energy penalty weight. Conversely, a low conditional prediction probability indicates that the speech recognition system struggles to predict the next word unit from the previous one, requiring the retention of some acoustic feature energy for recognition, resulting in a smaller energy penalty weight. Finally, the energy penalty weight is allocated to the corresponding time points on the audio signal time axis using a weight vector. A loss function is constructed based on the audio signal and weight vector, and adversarial speech examples are generated through iterative optimization. The loss function includes energy suppression logic associated with the weight vector, which forces a reduction in perturbation energy during periods corresponding to high-energy penalized weights (i.e., word units with high conditional prediction probability). During periods corresponding to low-energy penalized weights (i.e., word units with low conditional prediction probability), only the minimum energy required for recognition is retained. The resulting adversarial speech examples do not produce significant noise due to excessive energy, thus addressing the issues of insufficient concealment or energy waste in existing technologies.
[0108] Building upon the above embodiments, this embodiment considers that some speech recognition systems operate in a streaming manner, making it impossible to obtain the alignment information of the entire text in advance. Streaming speech recognition systems operate by receiving audio and decoding and recognizing it in real time, making it impossible to obtain the complete audio in advance or know in advance which time segment each word unit corresponds to in the entire audio. Therefore, this embodiment proposes a dynamic masking scheme based on a sliding window.
[0109] Specifically, in the process of generating adversarial examples, the decoding process of the speech recognition system is simulated in real time. Whenever the first k word units are successfully predicted, the language model is immediately invoked to predict the probability of the (k+1)th word unit.
[0110] In this embodiment, a probability threshold is set, which can be adjusted according to the actual situation. If the probability of the (k+1)th word unit is greater than the probability threshold, then the (k+1)th word unit is determined to be inferred from the preceding word units, and a strong suppression mode is activated. In the loss function, the L2 regularization penalty for the subsequent audio frames corresponding to this word unit is increased, forcing its energy to zero to avoid wasting energy. If the probability of the (k+1)th word unit is less than the probability threshold, then this word unit is determined to be a key information word unit, and a weak suppression mode is activated, allowing the generation of a certain magnitude of adversarial perturbation to ensure the recognition rate.
[0111] Optionally, the probability threshold can be set to 0.8, or it can be adjusted according to the actual situation.
[0112] In this embodiment, the final generated adversarial speech sample exhibits intermittent silence, with extremely faint noise only appearing at key semantic inflection points, and almost completely silent for most of the remaining time, thereby greatly reducing the human ear's perception rate.
[0113] Figure 4 This is a schematic diagram of the structure of the speech adversarial sample generation device provided in the embodiments of this application, as shown below. Figure 4 As shown, the speech adversarial sample generation device provided in this embodiment includes: a first acquisition module 401, an initialization module 402, a second acquisition module 403, a first generation module 404, a determination module 405, and a second generation module 406.
[0114] The first acquisition module 401 is used to acquire the target instruction text and the target speech recognition model; wherein the target instruction text is a malicious instruction recognized by the speech recognition system.
[0115] Initialization module 402 is used to initialize an audio signal.
[0116] The second acquisition module 403 is used to acquire the conditional prediction probability of each word unit in the target instruction text through the target speech recognition model.
[0117] The first generation module 404 is used to generate energy penalty weights aligned with the time axis of the audio signal based on the conditional prediction probability of each word unit in the target instruction text.
[0118] The determination module 405 is used to determine the weight vector based on the energy penalty weight of each word unit on the time axis.
[0119] The second generation module 406 is used to construct a loss function for the audio signal and weight vector, and generate adversarial speech examples through iterative optimization; the loss function is used to guide the optimization process.
[0120] In one possible implementation, the second acquisition module 403 is specifically used to: for any word unit, based on the word units preceding the current word unit, predict the current word unit and obtain the conditional prediction probability of the current word unit.
[0121] In one possible implementation, the first generation module 404 is specifically used to: map each word unit in the target instruction text onto the time axis of the audio signal; and determine the energy penalty weight of each word unit on the time axis based on the conditional prediction probability of each word unit in the target instruction text.
[0122] In one possible implementation, the energy penalty weight of each word unit is determined on the time axis based on the conditional prediction probability of each word unit in the target instruction text, using the following formula:
[0123]
[0124] in, This represents the energy penalty weight of the i-th word unit in the target instruction text. This represents the conditional prediction probability of the i-th word unit in the target instruction text. Represents the probability weighting coefficient. Indicates the base offset. This represents the i-th word unit in the target instruction text. This refers to the word unit preceding the i-th word unit in the target instruction text.
[0125] In one possible implementation, the weight vector is determined based on the energy penalty weight of each word unit on the time axis, using the following formula:
[0126]
[0127] in, This represents the weight vector at time t. This represents the energy penalty weight of the i-th word unit in the target instruction text. Indicates the start time of the i-th word unit. This represents the end time of the i-th word unit, and N represents the number of word units in the target instruction text.
[0128] In one possible implementation, the loss function for constructing the audio signal and weight vector is formulated as follows:
[0129]
[0130]
[0131] in, This represents the value of the loss function. Indicates the identification loss value. Let x represent the recognition function, and let x represent the audio signal. This refers to the perturbation signal superimposed on the audio signal. Represents the target instruction text; where, This represents the energy suppression loss value. This represents the number of moments on the timeline, where t represents the t-th moment on the timeline. This represents the weight vector at time t. This represents the disturbance signal superimposed on the audio signal at time t. express The square of the L2 norm.
[0132] In one possible implementation, the second generation module 406 is specifically used to iteratively execute the following steps: updating the perturbation signal using a gradient descent algorithm; obtaining a new loss function value based on the updated perturbation signal; determining the audio signal and the updated perturbation signal as initial adversarial speech samples; inputting the initial adversarial speech samples into the target speech recognition model, causing the target speech recognition model to output the confidence level of recognizing the initial adversarial speech samples as target instruction text; obtaining the energy of the initial adversarial speech samples; where energy refers to acoustic intensity; if the difference between the new loss function value and the loss function value in the previous iteration is less than a preset loss function threshold, and the confidence level is greater than a preset confidence threshold, and the energy is lower than a preset energy threshold, then the iteration stops, and the audio signal and the latest perturbation signal are determined as adversarial speech samples.
[0133] The speech adversarial sample generation device provided in this embodiment can execute the method provided in the above method embodiment. Its implementation principle and technical effect are similar, and will not be described in detail here.
[0134] Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 5As shown, the electronic device provided in this embodiment includes at least one processor 501 and a memory 502. Optionally, the electronic device further includes a communication component 503. The processor 501, memory 502, and communication component 503 are connected via a bus.
[0135] In a specific implementation, at least one processor 501 executes computer execution instructions stored in memory 502, causing at least one processor 501 to perform the above-described method.
[0136] The specific implementation process of processor 501 can be found in the above method embodiments, and its implementation principle and technical effect are similar. It will not be repeated here.
[0137] In the above embodiments, it should be understood that the processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in this invention can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules within the processor.
[0138] The memory may include random access memory (RAM) and may also include non-volatile memory (NVM), such as at least one disk storage device.
[0139] The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, the buses shown in the accompanying drawings are not limited to a single bus or a single type of bus.
[0140] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the above-described method.
[0141] This application also provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the above-described method.
[0142] The aforementioned readable storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The readable storage medium can be any available medium accessible to a general-purpose or special-purpose computer.
[0143] An exemplary readable storage medium is coupled to a processor, enabling the processor to read information from and write information to the readable storage medium. Of course, the readable storage medium can also be a component of the processor. The processor and the readable storage medium can reside in an Application Specific Integrated Circuit (ASIC). Alternatively, the processor and the readable storage medium can exist as discrete components in the device.
[0144] The division of units is merely a logical functional division; in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices, or units, and may be electrical, mechanical, or other forms.
[0145] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0146] In addition, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0147] If a function is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0148] Those skilled in the art will understand that all or part of the steps of the above-described method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When executed, the program performs the steps of the above-described method embodiments; and the aforementioned storage medium includes various media capable of storing program code, such as ROM, RAM, magnetic disks, or optical disks.
[0149] Finally, it should be noted that other embodiments of the invention will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention that follow the general principles of the invention and include common knowledge or customary techniques in the art not disclosed herein, and is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of the invention is limited only by the appended claims.
Claims
1. A method for generating adversarial speech examples, characterized in that, Applied to electronic devices, including: Obtain the target instruction text and the target speech recognition model; wherein the target instruction text is a malicious instruction identified by the speech recognition system; and the target speech recognition model is the speech recognition model used by the speech recognition system. Initialize an audio signal; The conditional prediction probability of each word unit in the target instruction text is obtained through the target speech recognition model. Based on the conditional prediction probability of each word unit in the target instruction text, an energy penalty weight aligned with the time axis of the audio signal is generated. A weight vector is determined based on the energy penalty weight of each word unit on the time axis; A loss function is constructed for the audio signal and the weight vector, and adversarial speech examples are generated through iterative optimization; wherein the loss function is used to guide the optimization process.
2. The method according to claim 1, characterized in that, The step of obtaining the conditional prediction probability of each word unit in the target instruction text through the target speech recognition model includes: For any given word unit, based on the word units preceding it, a prediction is made for the current word unit, and the conditional prediction probability of the current word unit is obtained.
3. The method according to claim 1, characterized in that, The step of generating energy penalty weights aligned with the time axis of the audio signal based on the conditional prediction probabilities of each word unit in the target instruction text includes: Map each word unit in the target instruction text onto the time axis of the audio signal; On the timeline, the energy penalty weight of each word unit is determined based on the conditional prediction probability of each word unit in the target instruction text.
4. The method according to claim 3, characterized in that, The formula for determining the energy penalty weight of each word unit based on the conditional prediction probability of each word unit in the target instruction text on the time axis is as follows: in, This represents the energy penalty weight of the i-th word unit in the target instruction text. This represents the conditional prediction probability of the i-th word unit in the target instruction text. Represents the probability weighting coefficient. Indicates the base offset. This represents the i-th word unit in the target instruction text. This refers to the word unit preceding the i-th word unit in the target instruction text.
5. The method according to claim 1, characterized in that, The formula for determining the weight vector based on the energy penalty weight of each word unit on the time axis is as follows: in, This represents the weight vector at time t. This represents the energy penalty weight of the i-th word unit in the target instruction text. Indicates the start time of the i-th word unit. The end time of the i-th word unit is indicated by N, and N represents the number of word units in the target instruction text.
6. The method according to claim 1, characterized in that, The formula for the loss function used to construct the audio signal and the weight vector is: in, This represents the value of the loss function. Indicates the identification loss value. Let x represent the recognition function, and let x represent the audio signal. This refers to the perturbation signal superimposed on the audio signal. This represents the target instruction text; in, This represents the energy suppression loss value. This represents the number of moments on the time axis, where t represents the t-th moment on the time axis. This represents the weight vector at time t. This represents the disturbance signal superimposed on the audio signal at time t. express The square of the L2 norm.
7. The method according to claim 6, characterized in that, The process of generating adversarial speech examples through iterative optimization includes: Iteratively execute the following steps: The gradient descent algorithm is used to update the perturbation signal; Based on the updated perturbation signal, obtain the new loss function value; The audio signal and the updated perturbation signal are determined as the initial adversarial speech sample; The initial adversarial speech sample is input into the target speech recognition model, so that the target speech recognition model outputs the confidence level of recognizing the initial adversarial speech sample as the target instruction text; The energy of the initial adversarial speech sample is obtained; wherein the energy refers to the acoustic intensity. If the difference between the new loss function value and the loss function value in the previous iteration is less than the preset loss function threshold, and the confidence level is greater than the preset confidence threshold, and the energy is lower than the preset energy threshold, then the iteration stops, and the audio signal and the latest perturbation signal are identified as speech adversarial samples.
8. A device for generating adversarial speech examples, characterized in that, Applied to electronic devices, including: The first acquisition module is used to acquire the target instruction text and the target speech recognition model; wherein the target instruction text is a malicious instruction recognized by the speech recognition system; and the target speech recognition model is the speech recognition model used by the speech recognition system. The initialization module is used to initialize an audio signal. The second acquisition module is used to acquire the conditional prediction probability of each word unit in the target instruction text through the target speech recognition model. The first generation module is used to generate energy penalty weights aligned with the time axis of the audio signal based on the conditional prediction probability of each word unit in the target instruction text. The determination module is used to determine the weight vector based on the energy penalty weight of each word unit on the time axis; The second generation module is used to construct a loss function for the audio signal and the weight vector, and generate adversarial speech examples through iterative optimization; wherein the loss function is used to guide the optimization process.
9. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor for executing the computer program to implement the method as described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1-7.
11. A computer program product, characterized in that, Includes a computer program that, when executed by a processor, implements the method described in any one of claims 1-7.