Mechanical arm deception attack side channel detection method and system based on acoustic features

By using an acoustic feature-based detection method and reconstructing the motion information of the robotic arm using LSTM and convolutional layer networks, the problem of difficulty in detecting spoofing attacks in existing technologies is solved, and effective identification and detection of spoofing attacks on robotic arms is achieved.

CN118721277BActive Publication Date: 2026-06-26GUANGXI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGXI UNIV
Filing Date
2024-07-11
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively detect spoofing attacks on industrial robotic arms, especially those that evade detection by replaying normal motion data. Furthermore, traditional methods require invasive modifications to the robotic arm platform or are susceptible to instability in characteristic parameters.

Method used

A side-channel detection method for robotic arm spoofing attacks based on acoustic features is proposed. This method involves constructing a training recognition model, using LSTM and convolutional layer networks to process acoustic and motion data, and combining acoustic analysis and data-driven methods to reconstruct the motion information of the robotic arm. The reconstructed information is then compared with the motion information of the SCADA system to identify differences and trigger anomaly alarms.

Benefits of technology

It effectively identifies robotic arm deception attacks, avoids intrusive modifications to the robotic arm platform, and improves the accuracy and stability of detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118721277B_ABST
    Figure CN118721277B_ABST
Patent Text Reader

Abstract

The application discloses a mechanical arm deception attack side channel detection method based on acoustic characteristics and belongs to the technical field of deep learning, which comprises the following steps: constructing a training recognition model; receiving the collected target audio, performing a sound processing step operation to extract features, performing a training recognition step operation to output predicted motion data; comparing the predicted motion data with the obtained corresponding real-time motion data, and evaluating whether the difference between each pair of parameters exceeds a preset threshold value. The application uses sound separation technology to separate mixed multi-axis sound into different individual axis sound, constructs an acoustic motion information recognition model of each motion joint through a collaborative physical and data-driven method, identifies the target sound according to the recognition model, obtains the predicted motion information of each axis, compares the predicted motion information with the collected real-time motion command, combines a threshold value to determine whether a deception attack is received, and realizes mechanical arm deception attack side channel detection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of deep learning technology, and in particular to a side-channel detection method for robotic arm deception attacks based on acoustic features. Background Technology

[0002] Industrial robotic arms, due to their high strength, precision, and speed, are used in industrial production lines with highly repetitive processes and are widely used in aerospace, electronics, and automotive industries. With the vigorous advancement of Industry 4.0 and intelligent manufacturing, industrialization and informatization are becoming increasingly intertwined, and industrial production sites are being connected to the Industrial Internet. While this has enabled equipment interconnection and improved production efficiency, it has also exposed industrial control systems lacking adequate protection to the Industrial Internet, creating opportunities for attackers. As a crucial component of intelligent manufacturing networks, attacks on industrial robotic arms can have immeasurable consequences.

[0003] In fact, an increasing number of industrial robotic arms are becoming targets for cyber attackers, leading to serious security vulnerabilities. Among these, deception attacks are particularly insidious threats. This type of attack involves not only manipulating the robotic arm's operation but also replaying normal motion data to evade detection. For example, in the CORMAND2 incident, attackers not only injected malicious code into the robotic arm using sophisticated methods but also replayed seemingly normal motion data to the SCADA system. Such attacks significantly reduce productivity and harm operators and equipment. Adversaries aiming to disrupt manufacturing networks can launch deception attacks against robotic arms in the following ways: (i) by sending seemingly correct data to simulate normal operations to deceive the SCADA system, thus executing a man-in-the-middle attack; (ii) by uploading malicious operation code to the robotic arm to manipulate its operation.

[0004] To address this, numerous studies have proposed various anomaly detection methods, utilizing physical models or data-driven approaches. However, these methods primarily rely on data collected from SCADA systems, and a key limitation is their inability to detect deceptive spoofing attacks, as the information is typically unaltered. Additionally, some studies have attempted to combat such spoofing attacks using side-channel information for cross-validation, leveraging metrics such as power consumption and electromagnetic signals. However, these methods often require invasive modifications to the robotic arm platform or are susceptible to instability in their characteristic parameters. Summary of the Invention

[0005] The purpose of this invention is to address the aforementioned problems by providing a side-channel detection method for robotic arm deception attacks based on acoustic features. This method, based on acoustic analysis and data-driven approaches, captures this unique acoustic-motion relationship, reconstructs the robotic arm's motion from acoustic data, and then compares the reconstructed acoustic motion information obtained from the side channel with the motion information collected by SCADA. Any discrepancies or motion information exceeding a predefined threshold will trigger an abnormal motion alarm, thereby effectively identifying potential intrusions.

[0006] To achieve the above objectives, the technical solution adopted by the present invention is as follows:

[0007] A side-channel detection method for robotic arm spoofing attacks based on acoustic features includes the following:

[0008] Step S1: Construct a training recognition model, including the following processing flow:

[0009] Sound processing steps: Noise reduction and sound separation are performed on the collected audio, and then features are extracted from the individual motion audio of each axis and concatenated into time-domain feature vectors and frequency-domain feature vectors;

[0010] Motion processing steps: Process the motion information collected for each axis, calculate the velocity, acceleration, and direction of motion for each axis, and form a motion information sequence;

[0011] Training and recognition steps: First, use an LSTM network to process the temporal feature vector input data, then use a convolutional layer network to process the frequency domain feature vector input data, then perform feature fusion, and then feed it into a classification network and a regression network. The classification network uses a fully connected layer and a softmax function to output orientation classification information, and the regression network uses a fully connected layer and a ReLU activation function to output velocity and acceleration prediction information, thus establishing a mapping between the sound feature vector and the motion information sequence.

[0012] Step S2: The training recognition model receives the collected target audio, performs sound processing operations to extract features, and performs training recognition operations to output the predicted motion data corresponding to the target audio.

[0013] Step S3: Compare the predicted motion data with the corresponding real-time motion data obtained, and evaluate whether the difference between each pair of parameters exceeds the preset threshold; if the difference is greater than the corresponding preset threshold, an intrusion alarm will be triggered.

[0014] The noise reduction and separation process in step S1, the audio processing step, is as follows:

[0015] By statistically analyzing the noise generated by each axis in a quiet environment, the motion frequency range is identified, and a bandpass frequency filter is used to eliminate noise outside the motion frequency range.

[0016] A sparse representation algorithm is used to separate individual motion audio for each axis, which moves independently; the processing flow includes the following:

[0017] (1) Configure the normalized amplitude spectrum of the mixed signal for simultaneous multi-axis operation; the amplitude spectrum is represented as follows:

[0018]

[0019] Where X is the amplitude spectrum of the sound signal from simultaneous multi-axis operation; A is the approximate amplitude spectrum of the mixed sound signal; B represents the matrix consisting of dictionaries for each axis; W represents the coefficient matrix; B is an overcomplete dictionary that includes all amplitude spectrum segments of the sound for each axis; A is the active set; b i It is an atom of B; w i It represents the coefficient;

[0020] (2) The extracted normalized amplitude spectrum is clustered using the k-means clustering algorithm to construct an overcomplete dictionary for each axis; the overcomplete dictionary is represented as follows:

[0021] B = (B x B y B z B u B v B w ) T

[0022] B x B is an overcomplete dictionary of axis x. y B is an overcomplete dictionary of the y-axis. z B is an overcomplete dictionary of axis z. u It is an overcomplete dictionary of axis u, B v B is an overcomplete dictionary of axis v. w It is an overcomplete dictionary of axis w;

[0023] (3) Solve for the sparse coefficients using the active set Newton algorithm; the specific processing is as follows:

[0024] sparsity coefficient w n by Perform initialization, and with Update; where 1 is a single vector of the same length as X; w A α is the representation coefficient of the atoms in the activity set A; α is the step size; It is the search direction; It's about w A The Hessian matrix; It is the gradient of the KL divergence with respect to the active atom weight vector; before calculating the inverse of the Hessian matrix, the identity matrix is ​​multiplied by a small constant ∈=10. -10 ; and, the activity set A is A = argmin n KL(X‖w n b n Initialize with ) and Update, where (KL(X‖w) n b n ) represents X and The KL divergence between them; through the above iterative algorithm, the activity set A and the representation coefficients w of the corresponding atoms in the activity set A are obtained. A ;

[0025] (4) Using the inverse Fourier transform and the approximate amplitude spectrum X for each axis i The original phase is used to reconstruct the individual motion audio for each axis;

[0026] Among them, the approximate amplitude spectrum of the mixed sound signal Expressed as follows:

[0027]

[0028] Therefore, the approximate amplitude spectrum of each axis is expressed by the following formula:

[0029] X i =w iA B iA ,i∈{x,y,z,u,v,w}

[0030] Here, i is not only the index of the axis, but also the index of the row of the complete dictionary B.

[0031] The feature extraction process for step S1, the sound processing step, is as follows:

[0032] In the time domain, the extracted features include frame energy, zero-crossing rate, temporal envelope, and energy entropy; in the frequency domain, spectral entropy and power-normalized cepstral coefficients are extracted.

[0033] For a given frame of length FL and an audio signal x(i) = 1, 2, ..., FL,

[0034] Frame energy is obtained by the following formula:

[0035]

[0036] The zero-crossing rate is obtained as follows:

[0037]

[0038] The signal is divided into frames, and the energy ratio of each frequency component within each frame is calculated. These energy ratios are then used to calculate the energy entropy of the sound. The frame is divided into sub-bands of length K, E j If the energy is the energy of the j-th short frame, then the energy entropy is:

[0039]

[0040] The temporal envelope of the sound is obtained in the following way:

[0041]

[0042] Where HT(i) is the analytic signal obtained by applying the Hilbert transform to the signal x(i), and the envelope T of the analytic signal is... e (i) is the square root of the sum of squares of its real part Re(HT(i)) and its imaginary part Im(HT(i));

[0043] X i(k) ,k=1,...,FL represents the magnitude of the Fast Fourier Transform coefficients for a given frame;

[0044] The spectrum is divided into N subbands, E f Let f be the energy of the f-th subband, then the spectral entropy is:

[0045]

[0046] Based on this, the time-domain features and frequency-domain features are concatenated to form a time-domain feature vector S. t {Frame energy, zero-crossing rate, energy entropy, time envelope} and frequency domain eigenvector S f : {Spectral entropy, power-normalized cepstral coefficients}.

[0047] In step S3, while obtaining the predicted motion data, the real-time motion data of the corresponding target from the SCADA system is processed and obtained to obtain the motion index parameters within the network. Then, the difference between each pair of parameters is evaluated according to the EWMA control chart. If the difference is greater than the corresponding preset threshold, an intrusion alarm will be triggered.

[0048] As mentioned above, sound separation technology is used to separate mixed multi-axis sounds into different individual axis sounds. An acoustic motion information recognition model for each motion joint is constructed through a collaborative physics-based and data-driven approach. Based on this recognition model, the target sound is identified, and the predicted motion information for each axis is obtained. By comparing this predicted motion information with the collected real-time motion commands and combining it with a threshold, it is determined whether a spoofing attack has occurred, thus realizing the side-channel detection of spoofing attacks on the robotic arm.

[0049] Based on the aforementioned scheme, in an improved scheme, in order to enhance noise reduction, the noise reduction and separation process of the audio processing step S1 of the robotic arm deception attack side channel detection method also includes the following enhanced noise reduction step: by applying a short time window, the spectral characteristics of the background noise during the brief inactivity interval within the operation period are captured using fast Fourier transform, the noise spectrum is subtracted from the main audio signal, and then the processed signal is converted back to the time domain using inverse fast Fourier transform to complete the enhanced noise reduction.

[0050] Based on the aforementioned scheme, in an improved scheme, to address the differences in processing and transmission speeds between the main channel and the side channel and improve the accuracy of side channel intrusion detection, step S1 of the robotic arm deception attack side channel detection method further includes the following trace alignment step: using cross-correlation analysis to align the audio data with the motion data in terms of time traces. Specifically, as follows:

[0051] First, cross-correlation analysis is used to analyze the relationship between each group of audio and motion data samples in the training dataset, obtaining the locally optimal time offset for each group of samples. Then, the minimum and maximum time offsets for all samples are determined. Next, using the minimum and maximum time offsets as the boundaries of the search interval, a grid search method is used within this search interval to find the globally optimal time offset τ for the entire training dataset. τ is then used as the fixed offset for trace alignment. The cross-correlation coefficient and the optimal time offset are calculated as follows:

[0052]

[0053] τ optimal =argmax τ R(τ)

[0054] R(τ) represents the sound time series x and the motion time series m measured at time offset τ; x(t) and m(t+τ) are the values ​​of the two series at times t and (t+τ), respectively; by changing τ, the cross-correlation coefficient at different time offsets can be calculated, maximizing the τ-value of R(τ). optimal This represents the optimal time offset between two sequences.

[0055] By adopting the above technical solution, the present invention has the following beneficial effects:

[0056] This invention uses sound separation technology to separate mixed multi-axis sounds (joint motion audio) into different individual axis sounds (individual motion audio of each axis of the joint). An acoustic motion information recognition model for each joint is constructed through a collaborative physics-based and data-driven approach. The target sound is identified based on the recognition model, and the predicted motion information for each axis is obtained. By comparing these predicted motion information with the collected real-time motion commands and combining them with a threshold, it is determined whether the robot arm has been subjected to a spoofing attack, thus realizing side-channel detection of spoofing attacks. Attached Figure Description

[0057] Figure 1 This is a flowchart of Example 1 of the robotic arm deception attack side-channel detection method of the present invention.

[0058] Figure 2 This is a flowchart of Example 2 of the robotic arm deception attack side-channel detection method of the present invention.

[0059] Figure 3 This is a flowchart of Example 2 of the robotic arm deception attack side-channel detection method of the present invention.

[0060] Figure 4 This is a diagram showing the differences in sound characteristics between two different axes of an N-axis industrial robotic arm.

[0061] Figure 5 This is a diagram illustrating the performance of the ASIDS system against a super opponent, based on the present invention.

[0062] Figure 6 This is a noise impact curve of the present invention. Detailed Implementation

[0063] The specific implementation of the invention will be further described below with reference to the accompanying drawings.

[0064] Example 1

[0065] like Figure 1 and Figure 2 As shown above, the basic scheme of the acoustic feature-based robotic arm spoofing attack side-channel detection method of this application includes steps S1-S3. Furthermore, a preferred improved scheme further includes an enhancement and noise reduction step, another preferred improved scheme further includes an alignment step, yet another preferred improved scheme further includes both enhancement and noise reduction steps and alignment steps, etc. The technical feature combinations of the basic scheme and each improved scheme are as described above. For the sake of simplicity, this explanation will only use the optimal combination of all the above schemes applied to an industrial robotic arm as an example. It should be emphasized that this method is not limited to the application of this application, but can be applied to intrusion detection of any robotic arm and any axis motion control device, beyond industrial robotic arms.

[0066] The acoustic feature-based side-channel detection method for robotic arm spoofing attacks in Embodiment 1 aims to automatically detect spoofing attacks by reconstructing motion information through acoustic side channels. It is a solution based on the principle of separation. This method involves using sound separation technology to separate mixed multi-axis sounds (such as the audio of a joint's motion) into different individual axis sounds (such as the individual motion audio of each axis of the joint). Simultaneously, the overall motion of the robotic arm (such as the joint's motion) is decomposed by the control system into single-axis motion commands (such as the individual motion of each axis of the joint). Therefore, the complex motion of the entire robotic arm is conceptualized as a combination of multiple single-axis motions. Following this separation principle, multi-axis motion is decomposed into simpler single-axis motions. Subsequently, an acoustic motion information model of each joint is constructed using a collaborative physics-based and data-driven approach.

[0067] Essentially, during operation, the control cabinet converts codes into signals and sends them to each actuator. Subsequently, the control system breaks down the robot arm's task into specific commands for each axis, thereby decomposing the robot arm's overall motion data into corresponding actions for each axis.

[0068] To accurately construct the motion state of each axis, it is necessary to eliminate the overlapping interference of sounds generated by the motion of different axes. This means separating the individual sound of each axis from the composite noise generated when multiple axes move simultaneously. The feasibility of this sound separation stems from the different acoustic characteristics accompanying the motion of each axis. An N-axis industrial robotic arm has N servo motors, and therefore N degrees of freedom. The noise generated by each axis during its motion is unique, such as... Figure 4 As shown. Figure 4 The diagrams illustrate the time-frequency waveforms (a and d), the power spectral density (PSD) diagrams (b and e), and the time-frequency spectrum diagrams (c and f) of the sound generated by two different axes during periodic reciprocating motion. Significant differences exist in the waveforms, frequency distributions, and energy distributions of the sound generated by the different axes, whether observed in the time domain, frequency domain, or time-frequency domain.

[0069] Robotic arm drive systems (motors and transmissions) produce different sounds when they vibrate during operation due to differences in their internal structures. Even if some axes use the same drive system, the captured sounds will still differ due to variations in factors such as the load on each axis and the angle and distance of the sound collector.

[0070] The rotational sound of the drive system powering each axis of a robotic arm not only provides unique information but also conveys details of the arm's motion, facilitating the construction of motion-sound correlations. Based on the code uploaded by engineers, each axis of the robotic arm moves according to parameters such as speed and direction expected by the control system. Therefore, the drive system for each axis provides the necessary power. Clearly, when the robotic arm operates at a higher speed, the internal rotation of the drive system accelerates, causing a change in the emitted sound. Similarly, when the robotic arm moves forward or backward, factors such as asymmetry in axial loads result in noticeable accompanying sounds. Figure 4 (b) Figure 4 As shown in (e), the Z-axis exhibits significant differences in motion in the two directions. Energy fluctuations are greater when it moves in opposite directions. Other axes also show similar patterns, with marked differences in characteristics such as energy and waveform curves during forward and backward motion. These characteristics help distinguish the direction and speed of motor motion. Therefore, similar movements of the robotic arm produce similar acoustic patterns, which can be parameterized and used for comparison to ensure the authenticity of the manufacturing process. This prompts the generation of corresponding acoustic motion reference information based on the sound changes accompanying different movements of the robotic arm.

[0071] like Figure 3 The workflow of this side-channel detection method for robotic arm spoofing attacks is shown. During the training and monitoring phases, audio signals and kinematic data undergo preprocessing before application, including noise reduction and sound separation, kinematic parameter calculation, and time alignment. Then, a technique similar to template fingerprint-based pattern matching is used to identify spoofing attack activities. During the training phase, acoustic motion information (motion fingerprint) for each axis is learned and generated by collecting data from a normal reference device. Subsequently, in the intrusion detection phase, it continuously monitors the robotic arm's motion state by reconstructing the acoustic motion information and matching it with the acoustic information of each axis of the target device. When the difference between the reconstructed acoustic motion information and the motion information collected by SCADA exceeds a predefined threshold (indicating a significant deviation between the motion data collected by SCADA and the actual motion), an anomaly (intrusion) alarm is reported. This will be described in more detail below.

[0072] In the sound preprocessing section, various processing methods are applied to the raw sound data to obtain an accurate representation as input to the acoustic motion information generation model. This stage mainly involves noise reduction, sound separation, feature extraction, and downsampling to reduce the computational complexity of subsequent steps.

[0073] Noise reduction:

[0074] To effectively reduce the impact of background noise on the operating sound of robotic arms in industrial environments, this application employs a noise reduction strategy combining bandpass frequency filtering and spectral subtraction. Through detailed statistical analysis of the noise generated by each axis of the robotic arm in a quiet environment, the main frequency range of the robotic arm's movement was successfully identified. The implemented bandpass frequency filter effectively eliminates noise outside these defined frequency ranges.

[0075] Furthermore, to enhance the noise reduction effect, spectral subtraction was employed, utilizing background noise captured during brief inactivity intervals within the robotic arm's operation cycle. By applying a short time window, a Fast Fourier Transform (FFT) was used to accurately capture the spectral characteristics of the background noise during these pauses. Subsequently, this noise spectrum was subtracted from the main audio signal, effectively removing the noise component. Finally, an Inverse Fast Fourier Transform (IFFT) was used to transform the processed signal back to the time domain, completing the noise reduction process.

[0076] Sound separation:

[0077] To mitigate interference between sounds moving along different axes, the denoised audio signal is first subjected to sound separation. A sparse representation algorithm is employed to separate the sounds moving independently along each axis. Sparse representation has been widely used in signal processing to extract specific audio source signals from mixed audio environments. However, it should be noted that this application prioritizes better separation quality rather than sparsity, thus differing from traditional sparse representation objectives.

[0078] In the context of sparse representation, any signal can be represented as a linear combination of a small number of basis signals, representing most or all of the original signal. In other words, a mixed audio signal of axes can be approximated by a linear superposition of individual audio segments from independent motions of each axis. Since sound is not sparse, but its amplitude spectrum is sparse, we convert it to a normalized amplitude spectrum to fit the sparse representation model.

[0079]

[0080] In the formula, X is the amplitude spectrum of the sound signal from simultaneous multi-axis operation. Let B be an approximate amplitude spectrum of the mixed sound signal. B represents the matrix consisting of dictionaries for each axis, while W represents the coefficient matrix. B is typically an overcomplete dictionary that includes all amplitude spectrum segments of the sound for each axis. At any given moment when the robotic arm performs a specific motion, the sound can be efficiently represented by only a few segments from dictionary B, since most other dictionary atoms correspond to different specific motions. A is the activity set, which includes only a few elements selected from the dictionary. B can approximately represent X. i It is an atom of B, w i This represents the coefficients. Therefore, the focus here is on dictionary construction and the algorithmic process for solving sparse representation.

[0081] For the construction of dictionary B, the goal of this application is to create an overcomplete dictionary containing basis vectors with a dimension greater than the signal they represent, thereby providing greater flexibility and robustness in capturing the fundamental features of sound signals through sparse representation. To obtain common acoustic features along a single axis, we use the k-means clustering algorithm to cluster the extracted normalized amplitude spectrum, thus constructing a high-quality overcomplete dictionary for each axis. This normalization helps to emphasize the relative frequency components within each time frame. The resulting overcomplete dictionary can be represented as follows:

[0082] B = (B x B y B z B u B v B w ) T (2)

[0084] Here, B is an overcomplete dictionary, where the rows of the matrix represent different axes, and each row contains the normalized amplitude spectrum of the axis signal. x B is an overcomplete dictionary of axis x. y B z B u B v B w Similarly, to solve for the sparse coefficients W, the Active Set Newton's algorithm (ASNA) is used. ASNA is a clever optimization method that uses an overcomplete dictionary constructed for each axis of the robotic arm to quickly solve for the effective set and the non-negative coefficient matrix. The active set A can be initialized as shown in Equation 3 and updated as shown in Equation 4:

[0085] A = argmin n KL(X‖w n b n (3)

[0086]

[0087] Among them, (KL(X‖w n b n ) represents X and The KL divergence between them. The sparsity coefficient w. n It can be initialized as shown in Equation 5 and updated as described in Equation 6:

[0088]

[0089] Here, 1 is a completely single vector with the same length as X. Aα is the representation coefficient of the atoms in the activity set A. α is the step size. It indicates the search direction. It's about w A The Hessian matrix. This is the gradient of the KL divergence relative to the active atom weight vector. Before calculating the inverse of the Hessian matrix, the identity matrix is ​​multiplied by a small constant ∈ = 10. -10 This is to ensure the numerical stability of the inverse.

[0090] Through the iterative algorithm described above, we can obtain the activity set A and the representation coefficients w of the corresponding atoms in the activity set A. A Approximate amplitude spectrum of mixed sound signals Equation 7 can be used to represent the approximate amplitude spectrum of each axis, and Equation 8 can be used to represent the approximate amplitude spectrum of each axis.

[0091]

[0092] X i =w iA B iA ,i∈{x,y,z,u,v,w} (8)

[0093] In Equation 8, i is not only the index of the axis but also the index of the row in the traversed dictionary B. Finally, by using the inverse Fourier transform and from each axis X... i The original phase of the approximate amplitude spectrum is used to reconstruct the pure motion audio of each axis.

[0094] Audio feature extraction:

[0095] To better construct the nonlinear relationship between the robotic arm's motion and its sound, commonly used features are extracted in the time and frequency domains to form an acoustic feature vector—the channel. In the time domain, extracted features include frame energy, zero-crossing rate (ZCR), temporal envelope, and energy entropy. In the frequency domain, features such as spectral entropy and power-normalized cepstral coefficients (PNCC) are extracted. Features are extracted from a set of frames with a fixed frame size of 50ms. From each frame, features are extracted and a feature vector is created for model training. For a given frame of length FL and audio signal x(i) = 1, 2, ..., FL, different feature extractions are as follows.

[0096] Changes in the speed and direction of the robotic arm's movement can affect the distribution of acoustic energy, making frame energy a potential feature. Frame energy can be obtained from the following formula:

[0097]

[0098] Zero-crossing rate (ZCR) is the number of times a signal waveform crosses zero within a specific time window, reflecting the frequency of positive and negative amplitude changes over a short period. When the robotic arm is stationary, amplitude changes are more frequent due to environmental noise, resulting in a higher ZCR. Therefore, the zero-crossing rate is a latent characteristic describing the motion state and can be obtained as follows:

[0099]

[0100] The energy entropy of sound is a characteristic that characterizes the uniformity of energy distribution in an audio signal. It is typically calculated by dividing the signal into frames and calculating the energy ratio of each frequency component within each frame, then using these energy ratios to calculate the entropy value. High energy entropy indicates a more uniform distribution of signal energy across frequency components, while low energy entropy indicates that the signal energy is concentrated in specific frequency components. Changes in the motion state of a robotic arm cause changes in sound energy, making energy entropy a potential characteristic describing the motion state. Dividing a frame into sub-bands of length K, if E... j If the energy is the energy of the j-th short frame, then we have:

[0101]

[0102] The temporal envelope of sound describes the trend of signal amplitude change over time, reflecting the overall loudness change trend of the sound signal. It helps in analyzing and monitoring the state and movement of a robotic arm and can be obtained through the following methods:

[0103]

[0104] Here, HT(i) is the analytic signal obtained by applying the Hilbert transform to the signal x(i). The envelope E(i) of the analytic signal is the square root of the sum of the squares of its real part Re(HT(i)) and its imaginary part Im(HT(i)).

[0105] For frequency domain data, let X i(k) Let k = 1, ..., FL be the magnitude of the Fast Fourier Transform (FFT) coefficients for a given frame. For spectral entropy, the spectrum is divided into N sub-bands. Let E f Let f be the energy of the f-th subband, then:

[0106]

[0107] Power normalized cepstral coefficients (PNCC) are a commonly used feature extraction technique in speech signal processing, an improved version of traditional Mel-frequency cepstral coefficients (MFCC). PNCC is primarily used to enhance the accuracy of speech recognition in noisy environments, more effectively capturing the essential features of sound. When analyzing the sound generated by robotic arm movements, PNCC also helps identify sound changes related to different motion states; therefore, PNCC is also extracted for subsequent use.

[0108] Finally, the time-domain features and frequency-domain features are concatenated to form a time-domain feature vector S. t {Frame Energy, ZCR, Energy Entropy, Temporal Envelope} and Frequency Domain Eigenvector S f {Spectral Entropy, PNCC} is prepared for subsequent model training.

[0109] In the motion data processing section, the joint spatial trajectory data collected from the robotic arm includes sampling times and coordinates for different joints, but may lack the complete velocity, acceleration, and direction required to fully describe the joint spatial trajectory. As a remedy, angular velocity and acceleration are estimated using coordinate points and time. Angular velocity is calculated as the rate of change of angle over time, and angular acceleration is calculated as the rate of change of angular velocity over time, thus enabling the determination of these parameters for the joints at different time points. Direction is represented by 0 and 1, where 1 represents forward motion and 0 represents backward motion. This can be determined by the coordinate difference between closely adjacent time points. Ultimately, the motion information consists of the robotic arm's motion direction, velocity, and acceleration, which can be represented as M:{speed,acceleration,direction}.

[0110] Trace alignment:

[0111] In the side-channel domain, due to numerous real-world factors, it is nearly impossible to perfectly collect aligned signal trajectories. This means that there is a clock offset between the two collected time series: the audio time series and the motion time series. Trajectory alignment can significantly improve the accuracy of subsequent side-channel analysis.

[0112] To improve detection accuracy, aligning audio data with robot motion data is crucial. This application employs cross-correlation analysis to align audio data with robot motion data. By analyzing the relationships between each sample in the dataset, we record the optimal time offset for each sample. It is important to note that the optimal time offset for each segment is not necessarily the optimal time offset for the entire dataset.

[0113] The training set contains many sets of signals. We first record the locally optimal time offset for each set of samples. Then, we determine the minimum and maximum time offsets for all samples. These minimum and maximum offsets are used as boundaries for the search interval. A grid search method is then used within this interval to find the globally optimal time offset τ, which maximizes the cross-correlation of the entire dataset. Finally, τ is used as a fixed offset for trajectory alignment.

[0114] Specifically, the cross-correlation coefficient R measures the similarity τ between two signals at different time offsets. The cross-correlation coefficient R can be calculated using the following formula: The cross-correlation coefficient can be calculated as follows:

[0115]

[0116] τ optimal =argmax τ R(τ) (15)

[0117] Here, R(τ) represents the sound time series x and the motion time series m measured at time offset τ. x(t) and m(t+τ) are the values ​​of the two sequences at times t and (t+τ), respectively. By changing τ, the cross-correlation coefficient at different time offsets can be calculated, and the τ that maximizes R(τ) represents the optimal time offset between the two trajectories. Furthermore, since the sampling rates of the displacement data and the audio data are different, this application adjusts the sampling rate of the motion data to match the number of frames in the audio data. Linear interpolation is used to make the number of motion data points equal to the number of audio data frames, thereby matching each frame with the corresponding motion information.

[0118] In the training and recognition model section, a deep learning model is used to capture the motion sound dependence of the robotic arm. Specifically, the collected audio is first denoised and sound separated, and then features are extracted from the independent motion audio of each axis. This process results in the extraction of temporal and frequency domain feature vectors of the motion sound. Subsequently, the motion information collected from each axis is processed to calculate the velocity and acceleration of each axis, forming a motion information sequence. A two-stream model is then applied to map the two feature vectors to the motion information sequence. By combining the temporal and frequency domain features of the sound accompanying the robotic arm's motion (both of which are related to the arm's motion), the two-stream model is used to better construct the nonlinear relationship between arm motion and sound.

[0119] The core idea of ​​the dual-stream model is to simultaneously utilize two independent network streams to process different types of input data. These are a time-domain stream and a frequency-domain stream. The time-domain stream focuses on capturing dynamic behavior that changes over time, particularly the temporal features of the robotic arm's motion. In contrast, the frequency-domain stream focuses on analyzing the frequency distribution characteristics of the moving sound. These two networks independently learn features from different domains, enhancing the model's understanding and representation of the training data. Finally, the outputs of the two networks are fused through an attention mechanism, enabling the model to more comprehensively predict motion information such as the robotic arm's speed and orientation.

[0120] Given that LSTM networks are well-suited for processing time-dependent data, temporal stream processing primarily utilizes LSTM. A sliding window method is employed to process the input data, creating a time window based on given acoustic features and velocity data. The input layer of temporal stream processing consists of temporal feature vectors composed of consecutive frames with a fixed window size. For each frame, frame energy, ZCR, energy entropy, and temporal envelope are extracted as features, which together constitute the input data. This is followed by multiple LSTM layers and fully connected layers, using ReLU as the activation function.

[0121] Frequency domain streams primarily utilize convolutional layers, which excel at capturing local patterns and frequency features within the frequency domain. Similarly, a fixed window size is used, with the input layer receiving frequency domain feature vectors and extracting spectral entropy (PNCC) for each frame; these elements together constitute the input data. Following this are convolutional layers, normalized pooling layers, and fully connected layers, using ReLU as the activation function.

[0122] After the time-domain and frequency-domain streams are processed by their respective dedicated networks, these two features are concatenated along the time axis and then fed into a network for a regression task to predict velocity and acceleration, and another network for a classification task to predict motion direction. Both networks first use an attention mechanism for feature fusion, allowing the model to learn to assign different weights to different features according to different tasks, optimizing overall performance and enhancing the model's understanding of complex data. The classification network uses fully connected layers and a softmax function to output direction classification, while the regression network uses fully connected layers and a ReLU activation function to output regression predictions of velocity and acceleration. The classification network uses cross-entropy loss as the loss function, and the regression network uses mean squared error (MSE) as the loss function.

[0123] In this way, the constructed model is used to estimate the observed motion information sequence, and the movement of the robotic arm is effectively identified by sound.

[0124] In the intrusion detection section, EWMA (Exponentially Weighted Moving Average) control charts are widely used to measure the weighted cumulative error between time series sequences. The final decision is made by using data provided by the SCADA system and estimates based on acoustic side channel data to determine whether an intrusion has occurred.

[0125] Algorithm 1, as described below, outlines the processing steps of an intrusion detection algorithm, which consists of two key stages: training and prediction. We begin with dual inputs of acoustic signals and kinematic data, aiming to derive the intrusion indicator F. D .

[0126] Algorithm 1: Intrusion Detection Algorithm

[0127]

[0128] During the training phase, an independent dictionary is built for each axis based on the sound signal. Fine-tuning speech separation using the ASNA algorithm to separate speech into x i The main acoustic features are extracted according to the calculation formulas (11)-(15): frame energy E f Zero crossing rate Z, time envelope T e Energy entropy E e Spectral entropy S e The motion data sample m undergoes a parameter extraction process to generate estimates of velocity v, acceleration a, and direction d, which is crucial for audio motion alignment via cross-correlation, as shown in Equation (8). This alignment is essential for subsequent training of the two-stream model.

[0129] During the prediction phase, the trained model is applied to the new robotic arm sound signal s' to extract features F' and infer motion parameters r. m Simultaneously, it processes real-time motion data m from the SCADA system. scada To obtain the motion index within the network: v scada a scada and d scada The EWMA detection phase constitutes a key part of the algorithm, in which each pair of parameters (r) is evaluated based on the EWMA control chart. m ,m scada If the difference |E ewma -m scada If the threshold is exceeded, an intrusion alarm will be triggered, indicating a potential anomaly.

[0130] Algorithm 1 comprehensively describes the process steps of reconstructing robotic arm motion information through acoustic side channels, thereby enabling cross-validation with data observed and monitored by the SCADA system to detect and indicate anomalies in spoofing attacks.

[0131] In the experimental and evaluation section, based on the aforementioned side-channel detection method for robotic arm deception attacks, we can construct a side-channel detection system for robotic arm deception attacks based on acoustic features, named ASIDS (acoustic side-channel intrusion detection systems). Based on this, we can verify the performance of the detection method and system in resisting deception attacks against industrial robotic arms through experimental setup and evaluation.

[0132] The experiment utilized a Borunte 1820A robotic arm. During the experiment, the recording equipment was positioned optimally to capture the best acoustic effects, recording the sounds accompanying the robotic arm's movements in WAV format at a sampling rate of 44.1 kHz. The corresponding displacement data was then obtained via SCADA. The experiment involved randomly playing noises recorded from industrial equipment (e.g., lathes and CNC machine tools) using two speakers. These noises were initially captured in an industrial environment to simulate the acoustic environment of an industrial setting. The data types involved in the experiment primarily included: basic motion data involving only a single axis; mixed multi-axis data; single-axis data separated after sound separation; and simulated attack data.

[0133] For the first three categories of data, stratified sampling was used, with 80% of the data designated for training and the remainder for testing. The first three categories were normal test data, and the fourth category was simulated intrusion test data. Hybrid multi-joint motion data was synthesized through random combinations of individual joint motion data. Basic motions form the foundation of the industrial robotic arm, and their combinations can cover all potential motions of the arm. Specifically, during the training phase, control code was written to make the robotic arm move in all directions across the entire range of motion of each axis, and mixed motion data of different numbers of axes separated by voice were also used as training data for each axis to ensure comprehensive feature learning. Cross-validation was used during training to enhance the robustness and generalization of the model. Using these stratified and cross-validated training data, model functions of the control parameters v, a, and d corresponding to the voice of each axis's motion were estimated.

[0134] The motion direction classification method was constructed as a binary task. Using accuracy as the primary metric, the analysis highlights the effectiveness of the deep learning model, particularly the proposed model, which consistently achieves accuracy above 95% across all axes, as shown in Table 1. This performance reflects not only the model's robustness but also its superior ability to discern complex auditory patterns associated with motion direction. The slightly lower accuracy observed on the W-axis (possibly due to reduced sound amplitude) suggests a path to further improve the model's sensitivity to auditory data. The comparative study with existing models (SVM, RF, CNN, LSTM, and RNN) enhances the state-of-the-art capabilities of deep learning techniques (especially CNN and RNN) in accurately classifying motion directions, highlighting the advanced analytical capabilities of the proposed model in this area.

[0135] TABLE I:Classification accuracy of each axis in different

[0136] models

[0137]

[0138] The fourth type of simulated data mentioned in the experimental setup was used to perform a deception attack on the robotic arm. The attack simulation was designed to reflect the three types of adversaries in the threat model. Weak adversary: ​​Lacking knowledge of the remote control protocol, the weak adversary cannot decode and modify remote control network packets; they cannot change the packets to achieve the desired effect. Strong adversary: ​​Possessing knowledge of the remote control protocol, the strong adversary can freely manipulate the remote control network packets, but lacks expertise in the robotic arm's path planning algorithm; they can modify control packets as needed to maliciously manipulate the robotic arm's movement; however, their ignorance of the path planning algorithm limits them to replaying previously recorded motion data; while they can bypass SCADA anomaly detection and perform destructive operations that may damage product quality and endanger on-site personnel, careful workers might notice if the robotic arm performs unfamiliar tasks. Super adversary: ​​The super adversary possesses knowledge of both the remote control protocol and the path planning algorithm, allowing them to manipulate remote control network packets and maliciously control the robotic arm's movement; they can construct motion information sequences based on normal control codes issued by the operator and send them to SCADA, making their actions difficult for SCADA monitoring personnel to detect. In situations where the opponent is weak, intercepting communication data packets prevents control commands from reaching the robotic arm. This type of attack is simulated by replacing the sound data corresponding to the motion data with static ambient audio. To simulate a scenario where a strong adversary replays recorded motion data, a fixed set of motion data and sounds not corresponding to those movements are used to create conditions. For a super adversary, who fully understands the robotic arm's trajectory planning, the replayed motion data will certainly match the operator's initial intent. Therefore, to simulate a super adversary's attack, only sound segments are replaced.

[0139] like Figure 5 As shown, the F1 score and detection latency are used to evaluate the solution of this application. The F1 score is the harmonic mean of precision (P) and recall (R), and the F1 score equals 2PR / (P+R). Detection latency is used to evaluate efficiency, referring to the time from intrusion to detection. The smaller the detection latency, the more timely the intrusion detection. Consider the detection latency T, which is the time t for detecting the attack. d Subtract the actual attack time t a That is, T = t d -t a .

[0140] Figure 5 (a) shows the F1 score of ASIDS in responding to weak adversaries. ASIDS achieved 100% F1 score across all six axes, demonstrating its strong ability to identify weak spoofing attacks. Furthermore, ASIDS exhibits a very fast attack detection response time, with an average detection latency of only 0.12s, indicating that the system can identify and respond to attacks almost instantly.

[0141] Figure 5 (b) presents ASIDS' performance against strong adversaries. In this scenario, ASIDS achieved an average F1 score of 97.7%, with most F1 scores above 95% across all six axes. ASIDS' response time to strong adversary attacks was slightly slower than against weaker adversaries because the replayed data might partially resemble the correct code, but it was still very fast, with an average detection latency of only 0.27s.

[0142] Figure 5 (c) Presents the performance of ASIDS against super adversaries. In this scenario, ASIDS achieved an average F1 score of 92.6%, with most of the six axes scoring above 90%, demonstrating its powerful ability to detect super spoofing attacks. The lowest detection rate in the test data of this application was 90.7%, possibly due to the attacker's modified commands being very close to the correct commands, the W-axis being the furthest from the recording device, and its lower sound amplitude, resulting in poorer sound quality. Nevertheless, the ASIDS system can detect attacks with high accuracy in the vast majority of cases. ASIDS's response time when detecting attacks from super adversaries is slightly slower than that of strong adversaries because the adversary may modify the code to make it very similar to the correct code, but it is still very fast, with an average detection latency of only 0.35 seconds.

[0143] Different signal-to-noise ratio (SNR) environments were constructed by independently capturing industrial noise played from loudspeakers, ambient noise in a quiet laboratory environment, and the sound of robotic arm movement. This setup allows for a detailed evaluation of the ASIDS's performance under various noise conditions, providing insights into the system's robustness and effectiveness in practical applications. The combined SNR was estimated by measuring the power of the robotic arm movement sound (as the signal component) and the separately played industrial noise (as the noise component). Figure 6 As shown, increased background noise is associated with decreased SNR and a lower attack detection rate. When the signal-to-noise ratio approaches 8 dB, the detection rate exceeds 90%, indicating significant system effectiveness under these conditions.

[0144] Generally, the greater the distance between the microphone and the robotic arm, the weaker the captured sound amplitude, thus affecting the effectiveness of intrusion detection. In this experiment, in addition to the data recorded by the microphone fixed to the robotic arm, the performance of the external microphone at distances (distance between the microphone and the robotic arm base) of 20 cm, 60 cm, and 100 cm was also tested. As shown in Table 2 below, where P represents Precision and R represents Recall, the accuracy of intrusion detection decreases as the distance between the microphone and the robotic arm increases. Compared to a fixed microphone, the performance degradation at distances of 20 cm, 60 cm, and 100 cm is partly due to the decreased sound reception due to increased distance and partly due to the Doppler effect. The microphone attached to the robotic arm reduces the relative distance between the microphone and the robotic arm, thereby mitigating the impact of the Doppler effect. Therefore, when deploying ASIDS, it is recommended to place the microphone as close to the robotic arm as possible to minimize the impact of the Doppler phenomenon.

[0145] TABLE III: Classification Results With Distance Parameter

[0146]

[0147] The speed of the robotic arm's motion planning varies with different movement distances, resulting in changes in the sound produced during the robotic arm's movement. Evaluating the performance of ASIDS at different movement distances provides a comprehensive assessment of its ability to capture the relationship between motion and sound. For this purpose, movement distance data corresponding to 20%, 50%, 80%, and 100% of the motion range for each axis were selected. As shown in Table 3 below, where P represents Precision and R represents Recall, performance peaks near 80% of the movement distance, and the overall trend is that it improves with increasing movement distance. Multi-axis detection performance tends to be relatively lower. However, from an overall perspective, the acoustic motion information captured by ASIDS demonstrates robustness in recognizing the sound characteristics of motion planning at different distances.

[0148] TABLE III: Classification Results With Distance Parameter

[0149]

[0150] As described above, this application presents an ASIDS (Automatic Sound Induction System) method and framework for detecting spoofing attacks on industrial robotic arms based on acoustic motion information generated using sound separation technology. It leverages the correlation between the acoustic side channel and the robotic arm's motion to protect the arm's safety. Specifically, it first utilizes the acoustic information features of each axis of the robotic arm, then separates mixed audio using speech separation technology. Next, it combines kinematic knowledge and data-driven methods to perform acoustic motion modeling. The reconstructed motion state sequence of the robotic arm through the audio side channel is cross-validated with the motion sequence collected by the SCADA system in the network to determine whether the network data has been tampered with or subjected to spoofing attacks. Any difference exceeding a preset threshold indicates a potential spoofing attack. When the cumulative error between the acoustically reconstructed motion information and the collected motion information exceeds the threshold, it is identified as a motion anomaly. This allows ASIDS to reconstruct and verify the robotic arm's motion pattern through acoustic signals, effectively detecting anomalies in motion. Tests were conducted using a Bronte 1820A robotic arm over more than 25,000 operation cycles. Experimental results show that ASIDS can accurately identify spoofing attacks within an average of 0.26 seconds, with an average F1 score of 94.3%. These experimental results demonstrate the effectiveness and practicality of ASIDS in improving the safety of robotic arms in manufacturing networks.

[0151] Example 2

[0152] Based on the acoustic feature-based side-channel detection method for robotic arm deception attacks in Embodiment 1, a robotic arm deception attack side-channel detection system based on acoustic features can be constructed. A brief description is given here. For details on the basic scheme and examples of combinations of technical features of various improved schemes, please refer to Embodiment 1 above.

[0153] The side-channel detection system for robotic arm spoofing attacks based on acoustic features in Embodiment 2 includes the following:

[0154] Model building module: Used to build and train the recognition model, including the following processing steps:

[0155] Sound processing steps: Noise reduction and sound separation are performed on the collected audio, and then features are extracted from the individual motion audio of each axis and concatenated into time-domain feature vectors and frequency-domain feature vectors;

[0156] Motion processing steps: Process the motion information collected for each axis, calculate the velocity, acceleration, and direction of motion for each axis, and form a motion information sequence;

[0157] Training and recognition steps: First, use an LSTM network to process the temporal feature vector input data, then use a convolutional layer network to process the frequency domain feature vector input data, then perform feature fusion, and then feed it into a classification network and a regression network. The classification network uses a fully connected layer and a softmax function to output orientation classification information, and the regression network uses a fully connected layer and a ReLU activation function to output velocity and acceleration prediction information, thus establishing a mapping between the sound feature vector and the motion information sequence.

[0158] Motion detection module: used to train the recognition model to receive the collected target audio, perform sound processing steps to extract features, perform training and recognition steps to output the predicted motion data corresponding to the target audio;

[0159] Attack detection module: It is used to compare the predicted motion data with the corresponding real-time motion data and evaluate whether the difference between each pair of parameters exceeds the preset threshold; if the difference is greater than the corresponding preset threshold, an intrusion alarm will be triggered.

[0160] The noise reduction and separation process in the sound processing steps of the model building module is as follows:

[0161] By statistically analyzing the noise generated by each axis in a quiet environment, the motion frequency range is identified, and a bandpass frequency filter is used to eliminate noise outside the motion frequency range.

[0162] By applying a short time window, the spectral characteristics of background noise during brief inactivity intervals within the operating cycle are captured using Fast Fourier Transform. This noise spectrum is then subtracted from the main audio signal, and the processed signal is converted back to the time domain using Inverse Fast Fourier Transform, thus completing the enhancement and noise reduction.

[0163] A sparse representation algorithm is used to separate individual motion audio for each axis, which moves independently; the processing flow includes the following:

[0164] (1) Configure the normalized amplitude spectrum of the mixed signal for simultaneous multi-axis operation; the amplitude spectrum is represented as follows:

[0165]

[0166] Where X is the amplitude spectrum of the sound signal from simultaneous multi-axis operation; A is the approximate amplitude spectrum of the mixed sound signal; B represents the matrix consisting of dictionaries for each axis; W represents the coefficient matrix; B is an overcomplete dictionary that includes all amplitude spectrum segments of the sound for each axis; A is the active set; b i It is an atom of B; w i It represents the coefficient;

[0167] (2) The extracted normalized amplitude spectrum is clustered using the k-means clustering algorithm to construct an overcomplete dictionary for each axis; the overcomplete dictionary is represented as follows:

[0168] B = (B x B y B z B u B v B w ) T

[0169] B x B is an overcomplete dictionary of axis x. y B is an overcomplete dictionary of the y-axis. z B is an overcomplete dictionary of axis z. u It is an overcomplete dictionary of axis u, B v B is an overcomplete dictionary of axis v. w It is an overcomplete dictionary of axis w;

[0170] (3) Solve for the sparse coefficients using the active set Newton algorithm; the specific processing is as follows:

[0171] sparsity coefficient w n by Perform initialization, and with Update; where 1 is a single vector of the same length as X; w A α is the representation coefficient of the atoms in the activity set A; α is the step size; It is the search direction; It's about w A The Hessian matrix; It is the gradient of the KL divergence with respect to the active atom weight vector; before calculating the inverse of the Hessian matrix, the identity matrix is ​​multiplied by a small constant ∈=10. -10 ; and, the activity set A is A = argmin n KL(X‖w n b n Initialize with ) and Update, where (KL(X‖w) n b n ) represents X and The KL divergence between them; through the above iterative algorithm, the activity set A and the representation coefficients w of the corresponding atoms in the activity set A are obtained. A ;

[0172] (4) Using the inverse Fourier transform and the approximate amplitude spectrum X for each axis i The original phase is used to reconstruct the individual motion audio for each axis;

[0173] Among them, the approximate amplitude spectrum of the mixed sound signal Expressed as follows:

[0174]

[0175] Therefore, the approximate amplitude spectrum of each axis is expressed by the following formula:

[0176] X i =w iA B iA ,i∈{x,y,z,u,v,w}

[0177] Here, i is not only the index of the axis, but also the index of the row of the complete dictionary B.

[0178] The feature extraction process for the sound processing steps in the model building module is as follows:

[0179] In the time domain, the extracted features include frame energy, zero-crossing rate, temporal envelope, and energy entropy; in the frequency domain, spectral entropy and power-normalized cepstral coefficients are extracted.

[0180] For a given frame of length FL and an audio signal x(i) = 1, 2, ..., FL,

[0181] Frame energy is obtained by the following formula:

[0182]

[0183] The zero-crossing rate is obtained as follows:

[0184]

[0185] The signal is divided into frames, and the energy ratio of each frequency component within each frame is calculated. These energy ratios are then used to calculate the energy entropy of the sound. The frame is divided into sub-bands of length K, E j If the energy is the energy of the j-th short frame, then the energy entropy is:

[0186]

[0187] The temporal envelope of the sound is obtained in the following way:

[0188]

[0189] Where HT(i) is the analytic signal obtained by applying the Hilbert transform to the signal x(i), and the envelope T of the analytic signal is... e (i) is the square root of the sum of squares of its real part Re(HT(i)) and its imaginary part Im(HT(i));

[0190] X i(k),k=1,...,FL represents the magnitude of the Fast Fourier Transform coefficients for a given frame;

[0191] The spectrum is divided into N subbands, E f Let f be the energy of the f-th subband, then the spectral entropy is:

[0192]

[0193] Based on this, the time-domain features and frequency-domain features are concatenated to form a time-domain feature vector S. t {Frame energy, zero-crossing rate, energy entropy, time envelope} and frequency domain eigenvector S f : {Spectral entropy, power-normalized cepstral coefficients}.

[0194] The model building module also includes a trace alignment step: cross-correlation analysis is used to align the audio data with the motion data in terms of time traces. Specifically:

[0195] First, cross-correlation analysis is used to analyze the relationship between each group of audio and motion data samples in the training dataset, obtaining the locally optimal time offset for each group of samples. Then, the minimum and maximum time offsets for all samples are determined. Next, using the minimum and maximum time offsets as the boundaries of the search interval, a grid search method is used within this search interval to find the globally optimal time offset τ for the entire training dataset. τ is then used as the fixed offset for trace alignment. The cross-correlation coefficient and the optimal time offset are calculated as follows:

[0196]

[0197] τ optimal =argmax τ R(τ)

[0198] R(τ) represents the sound time series x and the motion time series m measured at time offset τ; x(t) and m(t+τ) are the values ​​of the two series at times t and (t+τ), respectively; by changing τ, the cross-correlation coefficient at different time offsets can be calculated, maximizing the τ-value of R(τ). optimal This represents the optimal time offset between two sequences.

[0199] As mentioned above, sound separation technology is used to separate mixed multi-axis sounds into different individual axis sounds. An acoustic motion information recognition model for each motion joint is constructed through a collaborative physics-based and data-driven approach. Based on this recognition model, the target sound is identified, and the predicted motion information for each axis is obtained. By comparing this predicted motion information with the collected real-time motion commands and combining it with a threshold, it is determined whether a spoofing attack has occurred, thus realizing the side-channel detection of spoofing attacks on the robotic arm.

[0200] Example 3

[0201] The aforementioned Embodiments 1 and 2 include a robotic arm motion detection method based on acoustic features, which will be briefly described here. For details on the basic scheme and examples of combinations of technical features of various improved schemes, please refer to the aforementioned Embodiment 1.

[0202] The robotic arm motion detection method based on acoustic features in this embodiment 3 includes the following:

[0203] Step S1: Construct a training recognition model, including the following processing flow:

[0204] Sound processing steps: Noise reduction and sound separation are performed on the collected audio, and then features are extracted from the individual motion audio of each axis and concatenated into time-domain feature vectors and frequency-domain feature vectors;

[0205] Motion processing steps: Process the motion information collected for each axis, calculate the velocity, acceleration, and direction of motion for each axis, and form a motion information sequence;

[0206] Training and recognition steps: First, use an LSTM network to process the temporal feature vector input data, then use a convolutional layer network to process the frequency domain feature vector input data, then perform feature fusion, and then feed it into a classification network and a regression network. The classification network uses a fully connected layer and a softmax function to output orientation classification information, and the regression network uses a fully connected layer and a ReLU activation function to output velocity and acceleration prediction information, thus establishing a mapping between the sound feature vector and the motion information sequence.

[0207] Step S2: The training recognition model receives the collected target audio, performs sound processing operations to extract features, and performs training recognition operations to output the predicted motion data corresponding to the target audio.

[0208] The noise reduction and separation process in step S1, the audio processing step, is as follows:

[0209] By statistically analyzing the noise generated by each axis in a quiet environment, the motion frequency range is identified, and a bandpass frequency filter is used to eliminate noise outside the motion frequency range.

[0210] A sparse representation algorithm is used to separate individual motion audio for each axis, which moves independently; the processing flow includes the following:

[0211] (1) Configure the normalized amplitude spectrum of the mixed signal for simultaneous multi-axis operation; the amplitude spectrum is represented as follows:

[0212]

[0213] Where X is the amplitude spectrum of the sound signal from simultaneous multi-axis operation; A is the approximate amplitude spectrum of the mixed sound signal; B represents the matrix consisting of dictionaries for each axis; W represents the coefficient matrix; B is an overcomplete dictionary that includes all amplitude spectrum segments of the sound for each axis; A is the active set; b i It is an atom of B; w i It represents the coefficient;

[0214] (2) The extracted normalized amplitude spectrum is clustered using the k-means clustering algorithm to construct an overcomplete dictionary for each axis; the overcomplete dictionary is represented as follows:

[0215] B = (B x B y B z B u B v B w ) T

[0216] B x B is an overcomplete dictionary of axis x. y B is an overcomplete dictionary of the y-axis. z B is an overcomplete dictionary of axis z. u It is an overcomplete dictionary of axis u, B v B is an overcomplete dictionary of axis v. w It is an overcomplete dictionary of axis w;

[0217] (3) Solve for the sparse coefficients using the active set Newton algorithm; the specific processing is as follows:

[0218] sparsity coefficient w n by Perform initialization, and with Update; where 1 is a single vector of the same length as X; w A α is the representation coefficient of the atoms in the activity set A; α is the step size; It is the search direction; It's about w A The Hessian matrix; It is the gradient of the KL divergence with respect to the active atom weight vector; before calculating the inverse of the Hessian matrix, the identity matrix is ​​multiplied by a small constant ∈=10. -10 ; and, the activity set A is A = argmin n KL(X‖w n b n Initialize with ) and Update, where (KL(X‖w) n b n ) represents X and The KL divergence between them; through the above iterative algorithm, the activity set A and the representation coefficients w of the corresponding atoms in the activity set A are obtained. A ;

[0219] (4) Using the inverse Fourier transform and the approximate amplitude spectrum X for each axis i The original phase is used to reconstruct the individual motion audio for each axis;

[0220] Among them, the approximate amplitude spectrum of the mixed sound signal Expressed as follows:

[0221]

[0222] Therefore, the approximate amplitude spectrum of each axis is expressed by the following formula:

[0223] X i =w iA B iA ,i∈{x,y,z,u,v,w}

[0224] Here, i is not only the index of the axis, but also the index of the row of the complete dictionary B.

[0225] The feature extraction process for step S1, the sound processing step, is as follows:

[0226] In the time domain, the extracted features include frame energy, zero-crossing rate, temporal envelope, and energy entropy; in the frequency domain, spectral entropy and power-normalized cepstral coefficients are extracted.

[0227] For a given frame of length FL and an audio signal x(i) = 1, 2, ..., FL,

[0228] Frame energy is obtained by the following formula:

[0229]

[0230] The zero-crossing rate is obtained as follows:

[0231]

[0232] The signal is divided into frames, and the energy ratio of each frequency component within each frame is calculated. These energy ratios are then used to calculate the energy entropy of the sound. The frame is divided into sub-bands of length K, E j If the energy is the energy of the j-th short frame, then the energy entropy is:

[0233]

[0234] The temporal envelope of the sound is obtained in the following way:

[0235]

[0236] Where HT(i) is the analytic signal obtained by applying the Hilbert transform to the signal x(i), and the envelope T of the analytic signal is... e (i) is the square root of the sum of squares of its real part Re(HT(i)) and its imaginary part Im(HT(i));

[0237] X i(k) ,k=1,...,FL represents the magnitude of the Fast Fourier Transform coefficients for a given frame;

[0238] The spectrum is divided into N subbands, E f Let f be the energy of the f-th subband, then the spectral entropy is:

[0239]

[0240] Based on this, the time-domain features and frequency-domain features are concatenated to form a time-domain feature vector S. t {Frame energy, zero-crossing rate, energy entropy, time envelope} and frequency domain eigenvector S f : {Spectral entropy, power-normalized cepstral coefficients}.

[0241] In step S3, while obtaining the predicted motion data, the real-time motion data of the corresponding target from the SCADA system is processed and obtained to obtain the motion index parameters within the network. Then, the difference between each pair of parameters is evaluated according to the EWMA control chart. If the difference is greater than the corresponding preset threshold, an intrusion alarm will be triggered.

[0242] To enhance noise reduction, the noise reduction and separation process of step S1, the sound processing step, in the robotic arm deception attack side channel detection method includes the following enhanced noise reduction step: by applying a short time window, the spectral characteristics of the background noise during the brief inactivity interval within the operation period are captured using fast Fourier transform, the noise spectrum is subtracted from the main audio signal, and then the processed signal is converted back to the time domain using inverse fast Fourier transform to complete the enhanced noise reduction.

[0243] To address the differences in processing and transmission speeds between the main channel and side channels and improve the accuracy of side channel intrusion detection, step S1 of this robotic arm spoofing attack side channel detection method further includes the following trace alignment step: using cross-correlation analysis to align the audio data with the motion data in terms of time traces. Specifically:

[0244] First, cross-correlation analysis is used to analyze the relationship between each group of audio and motion data samples in the training dataset, obtaining the locally optimal time offset for each group of samples. Then, the minimum and maximum time offsets for all samples are determined. Next, using the minimum and maximum time offsets as the boundaries of the search interval, a grid search method is used within this search interval to find the globally optimal time offset τ for the entire training dataset. τ is then used as the fixed offset for trace alignment. The cross-correlation coefficient and the optimal time offset are calculated as follows:

[0245]

[0246] τ optimal =argmax τ R(τ)

[0247] R(τ) represents the sound time series x and the motion time series m measured at time offset τ; x(t) and m(t+τ) are the values ​​of the two series at times t and (t+τ), respectively; by changing τ, the cross-correlation coefficient at different time offsets can be calculated, maximizing the τ-value of R(τ). optimal This represents the optimal time offset between two sequences.

[0248] As mentioned above, sound separation technology is used to separate mixed multi-axis sounds into different individual axis sounds. An acoustic motion information recognition model for each motion joint is constructed through a collaborative physics-based and data-driven approach. Based on this recognition model, the target sound is identified, and the predicted motion information for each axis is obtained. By comparing this predicted motion information with the collected real-time motion commands and combining it with a threshold, it is determined whether a spoofing attack has occurred, thus realizing the side-channel detection of spoofing attacks on the robotic arm.

[0249] It should be noted that the examples of the above embodiments can preferably be combined with one or more of each other according to actual needs, and the accompanying drawings of multiple examples adopt a set of combined technical features, which will not be described in detail here.

[0250] The above description is a detailed explanation and illustration of the preferred embodiments of the present invention. However, these descriptions are not intended to limit the scope of protection claimed by the present invention. All equivalent changes or modifications made under the technical teachings of the present invention should fall within the patent protection scope covered by the present invention.

Claims

1. A side-channel detection method for robotic arm spoofing attacks based on acoustic features, characterized in that, Includes the following: Step S1: Construct a training recognition model, including the following processing flow: Sound processing steps: Noise reduction and sound separation are performed on the collected audio, and then features are extracted from the individual motion audio of each axis and concatenated into time-domain feature vectors and frequency-domain feature vectors; Motion processing steps: Process the motion information collected for each axis, calculate the velocity, acceleration, and direction of motion for each axis, and form a motion information sequence; Training and recognition steps: First, use an LSTM network to process the temporal feature vector input data, then use a convolutional layer network to process the frequency domain feature vector input data, then perform feature fusion, and then feed it into a classification network and a regression network. The classification network uses a fully connected layer and a softmax function to output orientation classification information, and the regression network uses a fully connected layer and a ReLU activation function to output velocity and acceleration prediction information, thus establishing a mapping between the sound feature vector and the motion information sequence. Step S2: The training recognition model receives the collected target audio, performs sound processing operations to extract features, and performs training recognition operations to output the predicted motion data corresponding to the target audio. Step S3: Compare the predicted motion data with the corresponding real-time motion data obtained, and evaluate whether the difference between each pair of parameters exceeds the preset threshold. If the difference exceeds the corresponding preset threshold, an intrusion alarm will be triggered.

2. The side-channel detection method for robotic arm spoofing attacks based on acoustic features according to claim 1, characterized in that: The noise reduction and separation process in step S1 of the audio processing is as follows: By statistically analyzing the noise generated by each axis in a quiet environment, the motion frequency range is identified, and a bandpass frequency filter is used to eliminate noise outside the motion frequency range. A sparse representation algorithm is used to separate individual motion audio for each axis, allowing for independent movement. The following processing steps are included: (1) Configure the normalized amplitude spectrum of the mixed signal for simultaneous multi-axis operation; the amplitude spectrum is represented as follows: in, It is the amplitude spectrum of the sound signal from simultaneous multi-axis operation; It is an approximate amplitude spectrum of the mixed sound signal; This represents a matrix consisting of dictionaries for each axis; Represents the coefficient matrix; It is an overcomplete dictionary that includes all amplitude spectrum segments of the sound for each axis; It is a collection of activities; yes Atoms; It represents the coefficient; (2) The extracted normalized amplitude spectrum is clustered using the k-means clustering algorithm to construct an overcomplete dictionary for each axis; the overcomplete dictionary is represented as follows: It is an axis A complete dictionary It is an axis A complete dictionary It is an axis A complete dictionary It is an axis A complete dictionary It is an axis A complete dictionary It is an axis A complete dictionary; (3) Solve for the sparse coefficients using the active set Newton algorithm; the specific processing is as follows: sparsity coefficient by Perform initialization, and with Update; among them, It is a with All-unique vectors of the same length; It is an event collection The representation coefficients of atoms in the middle; It is the step size; It is the search direction; It is about The Hessian matrix; It is the gradient of the KL divergence with respect to the active atom weight vector; before calculating the inverse of the Hessian matrix, the identity matrix is ​​multiplied by a small constant. ; and, activity collection by Perform initialization, and with Update, among which express and The KL divergence between them; processed by the above iterative algorithm, the activity set is obtained. and activity collection The representation coefficients of the corresponding atoms ; (4) Using inverse Fourier transform and approximate amplitude spectrum for each axis The original phase is used to reconstruct the individual motion audio for each axis; Among them, the approximate amplitude spectrum of the mixed sound signal Expressed as follows: Therefore, the approximate amplitude spectrum of each axis is expressed by the following formula: in, It includes not only axis indices but also overcomplete dictionaries. The index of the row.

3. The side-channel detection method for robotic arm spoofing attacks based on acoustic features according to claim 2, characterized in that: The noise reduction and separation process in step S1 of the sound processing steps also includes the following: By applying a short time window, the spectral characteristics of the background noise during the brief inactivity interval within the operation period are captured using Fast Fourier Transform. The noise spectrum is subtracted from the main audio signal, and then the processed signal is converted back to the time domain using Inverse Fast Fourier Transform to complete the enhancement and noise reduction.

4. The side-channel detection method for robotic arm spoofing attacks based on acoustic features according to claim 1, characterized in that: The feature extraction process for step S1, the sound processing step, is as follows: In the time domain, the extracted features include frame energy, zero-crossing rate, temporal envelope, and energy entropy; in the frequency domain, spectral entropy and power-normalized cepstral coefficients are extracted. For a given frame of length FL and an audio signal , Frame energy is obtained by the following formula: The zero-crossing rate is obtained as follows: ; The signal is divided into frames, and the energy ratio of each frequency component within each frame is calculated. These energy ratios are then used to calculate the energy entropy of the sound. The frame is divided into sub-bands of length K, E j If the energy is the energy of the j-th short frame, then the energy entropy is: The temporal envelope of the sound is obtained in the following way: in, It is the application of Hilbert transform to signals The obtained analytic signal, the envelope of the analytic signal It is actually part and its imaginary part The square root of the sum of squares; The magnitude of the Fast Fourier Transform coefficients for a given frame; The spectrum is divided into N subbands, E f Let f be the energy of the f-th subband, then the spectral entropy is: Based on this, the time-domain features and frequency-domain features are concatenated to form a time-domain feature vector S. t {Frame energy, zero-crossing rate, energy entropy, time envelope} and frequency domain eigenvector S f : {Spectral entropy, power-normalized cepstral coefficients}.

5. The side-channel detection method for robotic arm spoofing attacks based on acoustic features according to claim 1, characterized in that, Step S1 also includes a trace alignment step: cross-correlation analysis is used to align the audio data and motion data with time traces; specifically as follows: First, cross-correlation analysis is used to analyze the relationship between each group of audio and motion data samples in the training dataset to obtain the local optimal time offset for each group of samples. Then, the minimum and maximum time offsets for all samples are determined. Next, using the minimum and maximum time offsets as the boundaries of the search interval, a grid search method is used within this search interval to find the globally optimal time offset for the entire training dataset. ,use The fixed offset is used for trace alignment; the cross-correlation coefficient and the optimal time offset are calculated as follows: In time offset The sound time series measured at the location and motion time series ; and These are two sequences in time. and The value; by changing It can calculate the cross-correlation coefficient at different time offsets and maximize of This represents the optimal time offset between two sequences.

6. A side-channel detection system for robotic arm spoofing attacks based on acoustic features, characterized in that, Includes the following: Model building module: Used to build and train the recognition model, including the following processing steps: Sound processing steps: Noise reduction and sound separation are performed on the collected audio, and then features are extracted from the individual motion audio of each axis and concatenated into time-domain feature vectors and frequency-domain feature vectors; Motion processing steps: Process the motion information collected for each axis, calculate the velocity, acceleration, and direction of motion for each axis, and form a motion information sequence; Training and recognition steps: First, use an LSTM network to process the temporal feature vector input data, then use a convolutional layer network to process the frequency domain feature vector input data, then perform feature fusion, and then feed it into a classification network and a regression network. The classification network uses a fully connected layer and a softmax function to output orientation classification information, and the regression network uses a fully connected layer and a ReLU activation function to output velocity and acceleration prediction information, thus establishing a mapping between the sound feature vector and the motion information sequence. Motion detection module: used to train the recognition model to receive the collected target audio, perform sound processing steps to extract features, perform training and recognition steps to output the predicted motion data corresponding to the target audio; Attack detection module: used to compare the predicted motion data with the corresponding real-time motion data obtained, and to evaluate whether the difference between each pair of parameters exceeds a preset threshold. If the difference exceeds the corresponding preset threshold, an intrusion alarm will be triggered.

7. The side-channel detection system for robotic arm spoofing attacks based on acoustic features according to claim 6, characterized in that, The noise reduction and separation process in the sound processing steps of the model building module is as follows: By statistically analyzing the noise generated by each axis in a quiet environment, the motion frequency range is identified, and a bandpass frequency filter is used to eliminate noise outside the motion frequency range. By applying a short time window, the spectral characteristics of background noise during the brief inactivity interval within the operation cycle are captured using Fast Fourier Transform. This noise spectrum is subtracted from the main audio signal, and then the processed signal is converted back to the time domain using Inverse Fast Fourier Transform, thus completing the enhancement and noise reduction. A sparse representation algorithm is used to separate individual motion audio for each axis, allowing for independent movement. The following processing steps are included: (1) Configure the normalized amplitude spectrum of the mixed signal for simultaneous multi-axis operation; the amplitude spectrum is represented as follows: in, It is the amplitude spectrum of the sound signal from simultaneous multi-axis operation; It is an approximate amplitude spectrum of the mixed sound signal; This represents a matrix consisting of dictionaries for each axis; Represents the coefficient matrix; It is an overcomplete dictionary that includes all amplitude spectrum segments of the sound for each axis; It is a collection of activities; yes Atoms; It represents the coefficient; (2) The extracted normalized amplitude spectrum is clustered using the k-means clustering algorithm to construct an overcomplete dictionary for each axis; the overcomplete dictionary is represented as follows: It is an axis A complete dictionary It is an axis A complete dictionary It is an axis A complete dictionary It is an axis A complete dictionary It is an axis A complete dictionary It is an axis A complete dictionary; (3) Solve for the sparse coefficients using the active set Newton algorithm; the specific processing is as follows: sparsity coefficient by Perform initialization, and with Update; among them, It is a with All-unique vectors of the same length; It is an event collection The representation coefficients of atoms in the middle; It is the step size; It is the search direction; It is about The Hessian matrix; It is the gradient of the KL divergence with respect to the active atom weight vector; before calculating the inverse of the Hessian matrix, the identity matrix is ​​multiplied by a small constant. ; and, activity collection by Perform initialization, and with Update, among which express and The KL divergence between them; processed by the above iterative algorithm, the activity set is obtained. and activity collection The representation coefficients of the corresponding atoms ; (4) Using inverse Fourier transform and approximate amplitude spectrum for each axis The original phase is used to reconstruct the individual motion audio for each axis; Among them, the approximate amplitude spectrum of the mixed sound signal Expressed as follows: Therefore, the approximate amplitude spectrum of each axis is expressed by the following formula: in, It includes not only axis indices but also overcomplete dictionaries. The index of the row.

8. The side-channel detection system for robotic arm spoofing attacks based on acoustic features according to claim 6, characterized in that, The feature extraction process for the sound processing steps in the model building module is as follows: In the time domain, the extracted features include frame energy, zero-crossing rate, temporal envelope, and energy entropy; in the frequency domain, spectral entropy and power-normalized cepstral coefficients are extracted. For a given frame of length FL and an audio signal , Frame energy is obtained by the following formula: The zero-crossing rate is obtained as follows: ; The signal is divided into frames, and the energy ratio of each frequency component within each frame is calculated. These energy ratios are then used to calculate the energy entropy of the sound. The frame is divided into sub-bands of length K, E j If the energy is the energy of the j-th short frame, then the energy entropy is: The temporal envelope of the sound is obtained in the following way: in, It is the application of Hilbert transform to signals The obtained analytic signal, the envelope of the analytic signal It is actually part and its imaginary part The square root of the sum of squares; The magnitude of the Fast Fourier Transform coefficients for a given frame; The spectrum is divided into N subbands, E f Let f be the energy of the f-th subband, then the spectral entropy is: Based on this, the time-domain features and frequency-domain features are concatenated to form a time-domain feature vector S. t {Frame energy, zero-crossing rate, energy entropy, time envelope} and frequency domain eigenvector S f : {Spectral entropy, power-normalized cepstral coefficients}.

9. The side-channel detection system for robotic arm spoofing attacks based on acoustic features according to claim 6, characterized in that, The model building module also includes a trace alignment step: cross-correlation analysis is used to align the audio data with the motion data in terms of time traces; specifically as follows: First, cross-correlation analysis is used to analyze the relationship between each group of audio and motion data samples in the training dataset to obtain the local optimal time offset for each group of samples. Then, the minimum and maximum time offsets for all samples are determined. Next, using the minimum and maximum time offsets as the boundaries of the search interval, a grid search method is used within this search interval to find the globally optimal time offset for the entire training dataset. ,use The fixed offset is used for trace alignment; the cross-correlation coefficient and the optimal time offset are calculated as follows: In time offset The sound time series measured at the location and motion time series ; and These are two sequences in time. and The value; by changing It can calculate the cross-correlation coefficient at different time offsets and maximize of This represents the optimal time offset between two sequences.