Deep learning based network intrusion real-time detection method
By performing local manifold embedding and incremental retraining on deep learning networks, a sparse index table is generated and deployed to a hardware platform. This solves the problem of low computational resource utilization on hardware platforms in traditional methods and achieves low-latency real-time intrusion detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIAN UNIV OF TECH
- Filing Date
- 2026-04-08
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional intrusion detection methods based on rule matching and shallow machine learning are ill-equipped to deal with new variant attacks and hidden threats in encrypted traffic. Furthermore, deep learning technology has low computational resource utilization when deployed on hardware platforms, making it impossible to achieve low-latency, non-blocking real-time intrusion detection.
By obtaining the initial detection network and the original detection network after unstructured pruning, forward computation and local manifold embedding are performed to generate a sparse index table. Incremental retraining is performed using manifold offsets to adjust the weight distribution to achieve a block-structured sparse mode. The generated sparse index table is then deployed to the hardware platform.
It improved the utilization of computing units, solved the hardware pipeline bubble problem, realized low-latency real-time intrusion detection of high-speed network traffic, and improved detection throughput.
Smart Images

Figure CN122247722A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of network security technology, and more specifically to a real-time network intrusion detection method based on deep learning. Background Technology
[0002] With the explosive growth of network traffic and the increasing sophistication of network attack methods, traditional intrusion detection methods based on rule matching and shallow machine learning are struggling to cope with new variant attacks and hidden threats in encrypted traffic. Deep learning technology, due to its powerful automatic feature extraction capabilities, is widely used in the field of network intrusion detection, capable of uncovering deep-seated attack patterns from raw traffic or basic statistical characteristics. However, applying deep learning technology to real-time network intrusion detection faces significant challenges.
[0003] Although the number of parameters is reduced after unstructured pruning of deep neural networks, the sparse weights exhibit an irregular discrete distribution. This leads to the hardware being unable to identify and skip zero-value computation units when deployed to target hardware platforms such as FPGAs, resulting in severely low utilization of computing resources. At the same time, the extremely unbalanced computational load of each layer after pruning causes a large number of idle waiting bubbles in the hardware pipeline, making the actual processing throughput far below the 100Gbps line rate requirement, and unable to achieve low-latency, non-blocking real-time intrusion detection in high-speed network environments. Summary of the Invention
[0004] The purpose of this invention is to provide a real-time network intrusion detection method based on deep learning to solve the problems mentioned above.
[0005] The objective of this invention can be achieved through the following technical solutions: A deep learning-based real-time network intrusion detection method includes the following steps: S1, Obtain an initial detection network obtained through unstructured pruning and an uncompressed raw detection network. Both the initial detection network and the raw detection network are used for real-time network intrusion detection. S2, perform forward computation on the same network traffic sample by the initial detection network and the original detection network respectively, and extract the first intermediate feature map and the second intermediate feature map from multiple preset intermediate layers; S3, perform local manifold embedding on the first intermediate feature map and the second intermediate feature map respectively to obtain their embedding representations in the low-dimensional manifold space, and calculate the difference matrix between the embedding representations as the manifold offset; S4. Based on the manifold offset, a joint optimization objective is constructed, and the initial detection network is incrementally retrained. The weight distribution is adjusted through gradient backpropagation to promote the convergence of the weights towards a block-structured sparse pattern, thus obtaining a structured sparse detection network. S5 performs structured sparse pattern parsing on the weight matrix of the structured sparse detection network to generate a corresponding sparse index table. The structured sparse detection network and the sparse index table are then deployed together on the target hardware platform to perform intrusion detection on real-time network traffic.
[0006] As a further aspect of the present invention: S2 specifically includes: The saliency of the feature maps output by each intermediate layer of the original detection network is evaluated, and regions with response values exceeding a preset threshold are selected as key regions based on the saliency maps. Based on the spatial location of the key region in the original detection network, corresponding spatial transformation parameters are generated; The spatial transformation parameters are applied to the feature map of the corresponding intermediate layer of the initial detection network, and the spatial position of the feature map is calibrated to obtain the first intermediate feature map after calibration. The feature map of the intermediate layer of the original detection network is directly used as the second intermediate feature map.
[0007] As a further aspect of the present invention: obtaining the calibrated first intermediate feature map specifically includes: The spatial transformation parameters are analyzed to obtain the translation and scaling components; Based on the magnitude of the scaling component, an adaptive filtering window is determined. The feature maps of the corresponding intermediate layers of the initial detection network are then weighted and averaged using the adaptive filtering window to obtain smooth feature maps. Based on the translation component, a sampling grid is generated on the smoothed feature map, and each grid point in the sampling grid corresponds to a feature position on the calibrated feature map; Based on the coordinates of each grid point in the sampling grid, feature values are extracted from the smooth feature map using neighborhood interpolation to generate the calibrated first intermediate feature map.
[0008] As a further aspect of the present invention: S3 specifically includes: A neighborhood graph is constructed on the first intermediate feature map, and the first similarity between each feature point and its neighboring feature points in the first intermediate feature map is calculated. A first adjacency matrix is generated based on the first similarity. A neighborhood graph is constructed for the second intermediate feature map, and the second similarity between each feature point and its neighboring feature points in the second intermediate feature map is calculated. A second adjacency matrix is generated based on the second similarity. Based on the first adjacency matrix and the second adjacency matrix, the low-dimensional embedding representations of the first intermediate feature map and the second intermediate feature map under the condition of preserving the local neighborhood structure are solved respectively to obtain the first embedding representation and the second embedding representation; Calculate the pointwise difference between the first embedding representation and the second embedding representation, and combine the pointwise differences into a difference matrix as the manifold offset.
[0009] As a further aspect of the present invention: S4 specifically includes: The manifold offset is normalized to obtain a weighted mask; The weight mask is fused element-wise with the original gradient obtained by backpropagation of the initial detection network to generate the modulated gradient. The weights of the initial detection network are updated using the modulation gradient. During the update process, the weight matrix is scanned in blocks. When the average absolute value of the weights in the same block exceeds a preset threshold, all weights in the block are retained; otherwise, all weights in the block are reset to zero. The update process is repeated until convergence, resulting in a structured sparse detection network with a block-like clustering pattern in the weight distribution.
[0010] As a further aspect of the present invention: the generation of the modulation gradient specifically includes: Gaussian smoothing of the weight mask in the spatial dimension yields a continuously distributed attention weight map. The attention weight map is multiplied element-wise with the original gradient to obtain the weighted gradient; To correct the orientation consistency of the weighted gradient, the gradient component in the weighted gradient that is opposite in direction to the original gradient is set to zero, thus obtaining the corrected gradient. The correction gradient is clipped to the maximum amplitude, limiting gradient values exceeding a preset amplitude threshold to the threshold value, thus generating a modulation gradient.
[0011] As a further aspect of the present invention: S5 specifically includes: The weight matrix is divided into multiple equal-sized weight blocks by performing a block scan, and the distribution density of all non-zero weight values in each weight block is calculated. Based on the distribution density, weight blocks with a density exceeding a preset density threshold are selected as valid calculation blocks, and the starting coordinates and block size of each valid calculation block in the weight matrix are recorded. Huffman coding is performed on the starting coordinates and block size to generate a variable-length coding table. The variable-length coding table and the weight values of the effective computation blocks are combined to form a sparse index table. The sparse index table is aligned and reorganized according to the memory access granularity of the target hardware platform to generate a contiguous binary instruction stream. The binary instruction stream and the weight values of the structured sparse detection network are then loaded into the target hardware platform.
[0012] As a further aspect of the present invention: the generation of the variable-length encoding table specifically includes: The frequency distribution of the starting coordinates of all valid computation blocks in the weight matrix is statistically analyzed, and the starting coordinates are assigned first codewords of different lengths according to their frequency. The frequency distribution of the block size of all valid computation blocks in the weight matrix is statistically analyzed, and second codewords of different lengths are assigned to the block size according to the frequency. The first and second codewords are concatenated bit by bit to form a mixed codeword, and the mixed codewords are arranged in the order of their appearance at the starting coordinates to form a codeword sequence; The codeword sequence is encapsulated with the code table file generated during the Huffman coding process. The code table file records the mapping relationship between each starting coordinate and each block size and its corresponding codeword. The encapsulated data is used as a variable-length encoding table.
[0013] The beneficial effects of this invention are: (1) This invention transforms the scattered sparse weights generated by unstructured pruning into a block-structured sparse pattern through incremental retraining guided by manifold offset. This enables the target hardware platform to skip zero-value calculations using a sparse index table, effectively improving the utilization rate of computing units, solving the hardware pipeline bubble problem caused by unstructured sparsity, and increasing the throughput of real-time detection.
[0014] (2) The sparse index table generated by the present invention uses Huffman coding to compress the starting coordinates and block size of the effective computation block, reducing the storage overhead of the index data. At the same time, it is aligned and reorganized according to the hardware memory access granularity, reducing the access frequency of off-chip storage. Under the premise of ensuring detection accuracy, it realizes low-latency real-time intrusion detection of high-speed network traffic. Attached Figure Description
[0015] The invention will now be further described with reference to the accompanying drawings.
[0016] Figure 1 This is a flowchart of the method of the present invention. Detailed Implementation
[0017] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0018] Please see Figure 1 As shown, this invention is a real-time network intrusion detection method based on deep learning, comprising the following steps: S1, Obtain an initial detection network obtained through unstructured pruning and an uncompressed raw detection network. Both the initial detection network and the raw detection network are used for real-time network intrusion detection. S2, perform forward computation on the same network traffic sample by the initial detection network and the original detection network respectively, and extract the first intermediate feature map and the second intermediate feature map from multiple preset intermediate layers; S3, perform local manifold embedding on the first intermediate feature map and the second intermediate feature map respectively to obtain their embedding representations in the low-dimensional manifold space, and calculate the difference matrix between the embedding representations as the manifold offset; S4. Based on the manifold offset, a joint optimization objective is constructed, and the initial detection network is incrementally retrained. The weight distribution is adjusted through gradient backpropagation to promote the convergence of the weights towards a block-structured sparse pattern, thus obtaining a structured sparse detection network. S5 performs structured sparse pattern parsing on the weight matrix of the structured sparse detection network to generate a corresponding sparse index table. The structured sparse detection network and the sparse index table are then deployed together on the target hardware platform to perform intrusion detection on real-time network traffic.
[0019] In S1, an initial detection network obtained through unstructured pruning and an uncompressed raw detection network are acquired. Both the initial and raw detection networks are used for real-time network intrusion detection, specifically including: First, an uncompressed raw detection network is obtained. This raw detection network is a deep neural network for real-time network intrusion detection, and its network structure consists of multiple convolutional layers, pooling layers, and fully connected layers connected sequentially. The construction process of the raw detection network is as follows: First, network traffic data is collected as a training sample set. Each sample in this training sample set contains the packet header information and payload data of the network traffic, and each sample is labeled with a normal category or an attack category label. Second, the samples in the training sample set are sequentially input into a network structure to be trained, and the predicted category is calculated through forward propagation. Third, based on the difference between the predicted category and the labeled label, the cross-entropy loss function is used to calculate the loss value. This cross-entropy loss function measures the degree of difference by calculating the relative entropy between the predicted probability distribution and the true label distribution. Fourth, using the stochastic gradient descent optimization algorithm, all weight parameters in the network are updated through backpropagation based on the loss value. The learning rate in this stochastic gradient descent optimization algorithm is set to 0.01, and the momentum factor is set to 0.9. Steps two through four are repeated until the loss value no longer decreases in 10 consecutive iterations, at which point the raw detection network is obtained.
[0020] Secondly, unstructured pruning is performed on the original detection network to obtain an initial detection network. The specific implementation of unstructured pruning is as follows: First, the original detection network is propagated forward once on a validation sample set, and the absolute value of each weight value corresponding to each convolutional kernel in each convolutional layer is counted. Second, for each convolutional layer, a pruning ratio threshold of 50% is set. Specifically, all weight values in the layer are sorted from largest to smallest absolute value, the top 50% of weight values are kept unchanged, and the remaining 50% of weight values are set to 0. Third, the above pruning operation is performed layer by layer to obtain a sparse network with a large number of zero values in the weight matrix. Fourth, the pruned sparse network is fine-tuned and retrained on the training sample set. The retraining process uses the same forward propagation, loss calculation, and backpropagation steps as the original training, but only the weights that are not set to zero are updated during backpropagation. The weights that are already set to zero remain unchanged during the retraining process. The number of retraining iterations was set to 20 rounds, and the learning rate was adjusted to 0.001 during the retraining process to obtain the final unstructured pruned initial detection network.
[0021] The initial detection network has the same number of layers and connection scheme as the original detection network. The difference lies in the weight matrix of the initial detection network, which exhibits an unstructured sparse distribution, meaning that zero-valued weights are randomly scattered throughout the matrix without a regular block clustering pattern. The original detection network serves as a reference benchmark in subsequent steps, providing a standard for feature distribution; the initial detection network, as the object to be optimized, is used in the subsequent manifold alignment retraining process.
[0022] In S2, the initial detection network and the original detection network perform forward computation on the same network traffic sample, respectively, and extract the first intermediate feature map and the second intermediate feature map from multiple preset intermediate layers, specifically including: The first step is to evaluate the saliency of the feature map output by each intermediate layer of the original detection network. Regions with response values exceeding a preset threshold are selected as key regions based on the saliency map. The specific calculation method for saliency evaluation is as follows: For the feature map output by the Lth layer of the original detection network, which has H rows, W columns, and C channels, firstly, the sum of squares of the response values at each spatial location on the C channels is calculated, resulting in a single-channel saliency map with dimensions H rows by W columns. Then, a saliency threshold is set, which is 50% of the maximum value of the sum of squares of response values at all spatial locations on the saliency map. Spatial locations in the saliency map whose sum of squared response values exceeds this threshold are marked as 1, and the remaining locations are marked as 0, resulting in a binary mask map. The continuous regions marked as 1 in this binary mask map are the key regions.
[0023] The second step involves generating corresponding spatial transformation parameters based on the spatial location of the key regions within the original detection network. These parameters include translation and scaling components. The translation component is calculated by averaging the row and column coordinates of all spatial locations covered by the key region. This average value is used as the center point coordinates of the key region, and the row and column offsets between these center point coordinates and the center point coordinates of the feature map are taken as the translation component. The scaling component is calculated by subtracting the minimum row coordinate from the maximum row coordinate, and subtracting the minimum column coordinate from the maximum column coordinate, from the maximum column coordinate, and then using the ratios of these row and column spans to the total number of rows and columns of the feature map, respectively, as the scaling component.
[0024] The third step involves analyzing the spatial transformation parameters to obtain the translation and scaling components. The translation component includes two values: row offset and column offset. The scaling component includes two values: row scaling ratio and column scaling ratio. These four values are stored as a parameter set for subsequent feature map calibration operations.
[0025] The fourth step involves determining an adaptive filtering window based on the scaling components. This window is then used to perform a weighted average of the feature maps of the intermediate layers of the initial detection network, resulting in a smoothed feature map. The size of the adaptive filtering window is determined as follows: the row scaling ratio is multiplied by 8 and rounded up to obtain an odd number as the row size of the filtering window; the column scaling ratio is also multiplied by 8 and rounded up to obtain an odd number as the column size of the filtering window. For example, if the row scaling ratio is 0.6, multiplying by 8 gives 4.8, and rounding up gives 5, then the row size of the filtering window is 5; if the column scaling ratio is 0.3, multiplying by 8 gives 2.4, and rounding up gives 3, then the column size of the filtering window is 3. Then, a two-dimensional Gaussian weighted kernel is constructed using this filtering window size. The weights at the center of the Gaussian weighted kernel are the largest, gradually decreasing towards the edges. The feature map of the intermediate layer of the initial detection network is subjected to a two-dimensional convolution operation with the Gaussian weighted kernel mentioned above. The convolution stride is set to 1, and zeros are padded at the edges of the feature map to ensure that the output size remains unchanged. The convolution result is the smooth feature map.
[0026] The fifth step involves generating a sampling grid on the smoothed feature map based on the translation components. Each grid point in the sampling grid corresponds to a feature location on the calibrated feature map. The sampling grid is generated as follows: First, the target size of the calibrated feature map is determined. This target size is the same as the feature map size of the corresponding intermediate layer of the original detection network, i.e., it has H rows and W columns. Then, for each target location on the calibrated feature map, whose coordinates are at row i and column j, the coordinates of the corresponding sampling point on the smoothed feature map are calculated based on the row and column offsets in the translation components. The row coordinate of the sampling point equals i plus the row offset, and the column coordinate equals j plus the column offset. The coordinates of the sampling points corresponding to all target locations constitute a sampling grid with a size of H rows and W columns.
[0027] Step 6: Based on the coordinates of each grid point in the sampling grid, feature values are extracted from the smoothed feature map using neighborhood interpolation to generate the calibrated first intermediate feature map. Neighborhood interpolation is implemented using bilinear interpolation. The specific calculation process is as follows: For the coordinates of each sampling point in the sampling grid, its four adjacent integer coordinate points are taken, i.e., the four points formed by rounding down and up the row coordinates and rounding down and up the column coordinates. The horizontal and vertical distances between the sampling point and these four points are calculated respectively. Using these distances as weights, the feature values at the four points are weighted and summed to obtain the feature value of the sampling point. The feature values calculated for all sampling points are arranged in row and column order according to the target position to form the calibrated first intermediate feature map. This calibrated first intermediate feature map is spatially aligned with the feature map output by the corresponding intermediate layer of the original detection network.
[0028] Step 7: The feature map of the intermediate layer corresponding to the original detection network is directly used as the second intermediate feature map. The second intermediate feature map retains the feature distribution of the original detection network and is used as a reference in subsequent steps.
[0029] In S3, local manifold embedding is performed on the first and second intermediate feature maps respectively to obtain their embedding representations in the low-dimensional manifold space, and the difference matrix between the embedding representations is calculated as the manifold offset, specifically including: The first step involves constructing a neighborhood graph for the first intermediate feature map. The first similarity between each feature point in the first intermediate feature map and its neighboring feature points is calculated, and a first adjacency matrix is generated based on this first similarity. The first intermediate feature map has H rows, W columns, and C channels, and is considered as H multiplied by W feature points, each feature point being a vector with C elements. For each feature point, the Euclidean distance between it and other feature points in its surrounding neighborhood is calculated. The neighborhood is defined as a square region centered on the feature point with a side length of 5 pixels. The similarity is calculated based on the Euclidean distance, and the similarity value is determined using a Gaussian kernel function. Specifically, the similarity is equal to the negative exponent of the square of the Euclidean distance divided by a first temperature parameter, where the first temperature parameter is the median of the variance of the Euclidean distances among all feature points. The similarity values between each feature point and all feature points in its neighborhood are then filled into the corresponding positions in the matrix, resulting in a sparse matrix as the first adjacency matrix. This matrix has dimensions H multiplied by W rows and H multiplied by W columns.
[0030] The second step involves constructing a neighborhood graph for the second intermediate feature map. This involves calculating the second similarity between each feature point in the second intermediate feature map and its neighboring feature points, and generating a second adjacency matrix based on this second similarity. The second intermediate feature map also has H rows, W columns, and C channels, and is considered as H multiplied by W feature points, each feature point being a vector with C elements. For each feature point, the Euclidean distance between it and other feature points in its surrounding neighborhood is calculated. The neighborhood is defined as a square region centered on the feature point with sides of 5 pixels. The similarity is calculated based on the Euclidean distance, using a Gaussian kernel function. Specifically, the similarity is equal to the negative exponent of the square of the Euclidean distance divided by the second temperature parameter, where the second temperature parameter is the median of the variances of the Euclidean distances among all feature points. The similarity values between each feature point and all its neighboring feature points are then filled into the corresponding positions in the matrix, resulting in a sparse matrix that serves as the second adjacency matrix. This matrix has dimensions H multiplied by W rows and H multiplied by W columns.
[0031] The third step involves solving for the low-dimensional embedding representation of the first intermediate feature map while preserving the local neighborhood structure, based on the first adjacency matrix, to obtain the first embedding representation. The solution process uses the Laplacian eigenmap method, with the following specific steps: First, calculate the first diagonal matrix based on the first adjacency matrix. Each element on the diagonal of the first diagonal matrix is equal to the sum of all non-zero elements in the corresponding row of the first adjacency matrix. Then, calculate the first Laplacian matrix, which is equal to the first diagonal matrix minus the first adjacency matrix. Next, solve the generalized eigenvalue problem, i.e., find the eigenvalues and eigenvectors that satisfy the condition that the product of the first Laplacian matrix and the eigenvector equals the product of the eigenvalues and the first diagonal matrix and the eigenvectors. Select the eigenvectors corresponding to the smallest eigenvalues from the 2nd to the d+1th eigenvalue, where d is set to 16. Arrange these eigenvectors column-wise to obtain a matrix of H x W rows and 16 columns. Each row of this matrix corresponds to the embedding coordinates of a feature point in a 16-dimensional low-dimensional space. Rearrange these matrix rows to form a feature map of H rows, W columns, and 16 channels, which is the first embedding representation.
[0032] The fourth step involves solving for the low-dimensional embedding representation of the second intermediate feature map while preserving the local neighborhood structure, based on the second adjacency matrix, to obtain the second embedding representation. The solution process also employs the Laplacian eigenmap method, with the following specific steps: First, calculate the second diagonal matrix based on the second adjacency matrix. Each element on the diagonal of the second diagonal matrix is equal to the sum of all non-zero elements in the corresponding row of the second adjacency matrix. Then, calculate the second Laplacian matrix, which is equal to the second diagonal matrix minus the second adjacency matrix. Next, solve the generalized eigenvalue problem, i.e., find the eigenvalues and eigenvectors that satisfy the condition that the second Laplacian matrix multiplied by the eigenvector equals the eigenvalue multiplied by the second diagonal matrix multiplied by the eigenvector. Select the eigenvectors corresponding to the smallest eigenvalues from the 2nd to the d+1th eigenvalue, where d is set to 16. Arrange these eigenvectors column-wise to obtain a matrix of H x W rows and 16 columns. Each row of this matrix corresponds to the embedding coordinates of a feature point in a 16-dimensional low-dimensional space. Rearrange these matrix rows to form a feature map of H rows, W columns, and 16 channels, which is the second embedding representation.
[0033] The fifth step involves calculating the pointwise difference between the first and second embedding representations, and combining these pointwise differences into a difference matrix as the manifold offset. For each spatial location, i.e., row i and column j, the difference between the 16-dimensional vector of the first embedding representation and the 16-dimensional vector of the second embedding representation at that location is calculated. The difference is calculated by subtracting the vector value of the second embedding representation from the vector value of the first embedding representation, resulting in a 16-dimensional difference vector. These 16-dimensional difference vectors are then arranged in spatial order to obtain a difference matrix with H rows, W columns, and 16 channels. This difference matrix is the manifold offset, used to characterize the degree of local structural shift in the feature distribution between the initial detection network and the original detection network.
[0034] In S4, a joint optimization objective is constructed based on the manifold offset. The initial detection network is incrementally retrained, and its weight distribution is adjusted through gradient backpropagation to promote the convergence of the weights towards a block-structured sparse pattern, resulting in a structured sparse detection network, which specifically includes: The first step is to normalize the manifold offset to obtain a weight mask. The manifold offset is a data cube with H rows, W columns, and 16 channels. First, the sum of squares of all 16 channels at each spatial location is calculated, resulting in a two-dimensional energy map with H rows and W columns. Then, all spatial locations on this two-dimensional energy map are traversed, and the maximum value is found. The energy value at each spatial location is divided by this maximum value, resulting in a normalized energy map with values between 0 and 1. This normalized energy map is used as the weight mask. Regions with larger values in the weight mask indicate a more significant feature shift between the initial detection network and the original detection network at that spatial location.
[0035] The second step involves Gaussian smoothing the weight mask spatially to obtain a continuously distributed attention weight map. A two-dimensional Gaussian convolution kernel is constructed, with dimensions set to 7 rows by 7 columns and a standard deviation of 2 pixels. The weight mask obtained in the first step is used as input and convolved with the Gaussian kernel in a two-dimensional convolution operation. The convolution stride is set to 1, and zero-padding is applied to the edges to ensure that the output size matches the input size. The convolution result is a smoothed weight mask, where the originally discrete salient regions are expanded into a continuously distributed attention weight map. This attention weight map also has H rows and W columns, and the value at each spatial location represents the degree of attention that location should receive during gradient modulation.
[0036] The third step involves element-wise multiplying the attention weight map with the original gradient obtained from the backpropagation of the initial detection network to obtain a weighted gradient. After each forward propagation, the initial detection network calculates the loss value using the loss function, and then uses the backpropagation algorithm to calculate the original gradient value corresponding to each weight parameter. For the gradient map corresponding to the feature map output by a certain convolutional layer in the initial detection network, this gradient map has H rows, W columns, and... One channel, among which This represents the number of output channels of the convolutional layer. The attention weight map obtained in the second step is copied and expanded in the spatial dimension to increase its channel count. This yields an attention weight tensor with the exact same size as the gradient map. This attention weight tensor is then multiplied element-wise with the gradient map at each spatial location and each channel to obtain a weighted gradient. The gradient magnitude corresponding to regions with significant feature offsets is amplified in this weighted gradient.
[0037] The fourth step is to perform directional consistency correction on the weighted gradient to obtain the corrected gradient. The specific operation of directional consistency correction is as follows: For each weight parameter, compare the sign of the weighted gradient value at that position with the sign of the original gradient at that position. If the sign of the weighted gradient is the same as the original gradient, the weighted gradient value remains unchanged; if the sign of the weighted gradient is opposite to the original gradient, the weighted gradient value at that position is set to 0. This comparison and setting operation is performed on all weight parameters one by one to obtain the corrected gradient. This corrected gradient ensures that the direction of gradient update always remains consistent with the direction of descent of the original loss function, avoiding deviation in the optimization direction caused by attention enhancement.
[0038] The fifth step involves amplitude clipping of the correction gradient to generate the modulation gradient. First, the absolute values of all non-zero gradient values in the correction gradient are counted, and the 90th percentile of these absolute values is used as the amplitude threshold. Then, each gradient value in the correction gradient is iterated over. If the absolute value of the gradient value exceeds the amplitude threshold, the gradient value is replaced with the amplitude threshold multiplied by the sign of the gradient value; if the absolute value of the gradient value does not exceed the amplitude threshold, the gradient value is retained. The resulting modulation gradient is used in subsequent weight update steps.
[0039] Step 6: Update the weights of the initial detection network using the modulated gradient. The stochastic gradient descent optimization algorithm is used for weight updates, with a learning rate of 0.001 and a momentum factor of 0.9. For each weight parameter, the update method is: the new value of the weight parameter equals its current value minus the learning rate multiplied by the value of the modulated gradient at that position. This update operation is performed layer by layer to complete one iteration of weight updates.
[0040] Step 7: After each weight update, the updated weight matrix is scanned in blocks. The weight matrix of each convolutional layer is divided into blocks of a fixed size of 16 rows by 16 columns. For each weight block, the absolute mean of all 16 by 16 weight values within that block is calculated. A retention threshold is set, which is 50% of the absolute mean of all weights in the current convolutional layer. If the absolute mean of a weight block exceeds the retention threshold, all weight values within that block are retained; if the absolute mean of a weight block does not exceed the retention threshold, all weight values within that block are set to zero. This block scanning and zeroing operation is performed layer by layer, causing the weight distribution to gradually exhibit a block-like clustering pattern.
[0041] Step 8: Repeat steps 1 through 7 above, that is, calculate the loss value after each forward propagation, obtain the original gradient through backpropagation, generate the modulated gradient, update the weights, and perform block scanning to zero, until the change in the loss value is less than 0.01 in 5 consecutive iterations, at which point convergence is determined. The detection network obtained after convergence is the structured sparse detection network, in which the non-zero weights in the weight matrix are distributed in a regular 16-row by 16-column block-like pattern.
[0042] In S5, the weight matrix of the structured sparse detection network is parsed using structured sparse patterns to generate a corresponding sparse index table. The structured sparse detection network and the sparse index table are then deployed together on the target hardware platform for intrusion detection of real-time network traffic, specifically including: The first step is to perform a block scan on the weight matrix, dividing it into multiple weight blocks of equal size, and calculating the distribution density of all non-zero weight values within each block. Each convolutional layer of the structured sparse detection network corresponds to a weight matrix with R rows and C columns, where R represents the number of input channels multiplied by the kernel height, and C represents the number of output channels multiplied by the kernel width. This weight matrix is divided according to a fixed block size of 16 rows by 16 columns, resulting in several consecutive 16x16 weight blocks. If a block has fewer than 16 rows or columns at a boundary, it is padded with zeros to complete the block. For each weight block, the number of all non-zero weight values within that block is counted, denoted as _____. subscript This indicates the index of the block in the line direction. This represents the index of the block along the column direction. The total number of weights within each weighted block is... =256 (i.e., 16 by 16). Therefore, the distribution density of this weighted block is... Defined as the ratio of the number of non-zero weights to the total number of weights, the calculation formula is as follows: The density value, which is between 0 and 1, reflects the degree of aggregation of non-zero weights in the weight block.
[0043] The second step involves selecting weight blocks with densities exceeding a preset density threshold as valid computation blocks based on the distribution density. The starting coordinates and size of each valid computation block in the weight matrix are recorded. A density threshold is set to 0.3, meaning that when... When the value is greater than 0.3, the non-zero weights within the weight block are considered relatively dense, making it worthwhile to retain and calculate them in hardware. For all weight blocks that meet the condition, their starting coordinates in the weight matrix are recorded. The starting coordinates include the row and column indices, specifically the row and column numbers of the first weight at the top left corner of the block. The block size is also recorded. Since all blocks are 16x16, the block size is fixed at 16 rows and 16 columns, but for the sake of generality in subsequent encoding, the block size is still recorded as a parameter. A list is compiled of the starting coordinates and block sizes of all valid calculated blocks for subsequent encoding.
[0044] The third step involves Huffman coding of the starting coordinates and block size to generate a variable-length coding table. Huffman coding is a lossless compression coding method that assigns codewords of different lengths based on the frequency of symbol occurrence. First, the frequency distribution of the starting coordinates of all valid computation blocks in the weight matrix is statistically analyzed. Assume there are L different possible values for the starting coordinates (e.g., row coordinates range from 0 to R-16, column coordinates range from 0 to C-16). For each specific starting coordinate... The number of times it appears is recorded as If the total number of all valid computation blocks is M, then the frequency of occurrence of this starting coordinate is... The calculation is as follows: Based on frequency, the starting coordinate with the highest frequency is assigned the shortest binary codeword, and the starting coordinate with the lowest frequency is assigned the longest binary codeword. A Huffman tree is constructed to obtain the first codeword corresponding to each starting coordinate. Similarly, the frequency distribution of block sizes for all valid computation blocks is statistically analyzed. Since block sizes may vary between different layers or blocks, but all valid computation blocks in this scheme use a fixed 16x16 size, there is only one possible block size with a frequency of 1. The corresponding second codeword can be assigned a single bit "0" or "1", but for generality, it is still processed according to the Huffman coding rules. In practical applications, if there are multiple block sizes, similar processing is used. The first codeword corresponding to each starting coordinate and the second codeword corresponding to each block size are recorded in a code table file.
[0045] The fourth step involves concatenating the first and second codewords bit by bit to form a mixed codeword. These mixed codewords are then arranged in the order they appear at their starting coordinates, forming a codeword sequence. Specifically, for each valid computation block, the first codeword corresponding to its starting coordinate is extracted, followed by the second codeword corresponding to its block size. These two codewords are then concatenated sequentially to obtain a binary mixed codeword. Finally, following the scanning order of the valid computation blocks in the weight matrix from top to bottom and from left to right, all the mixed codewords are arranged sequentially to form a codeword sequence. This codeword sequence contains the position and size information of all valid computation blocks.
[0046] The fifth step involves encapsulating the codeword sequence with the code table file generated during Huffman coding to obtain a variable-length code table. The code table file records the mapping between each starting coordinate and its first codeword, as well as the mapping between each block size and its second codeword. The codeword sequence and code table file are then merged and stored to form a complete sparse index table. This sparse index table can be parsed by hardware to locate the position of a valid computation block and obtain its size.
[0047] The sixth step involves aligning and reorganizing the sparse index table according to the memory access granularity of the target hardware platform to generate a contiguous binary instruction stream. Off-chip memory access on the target hardware platform (such as an FPGA) is typically performed with a fixed bit width, for example, a 512-bit burst transfer. Therefore, all binary data in the sparse index table (including codeword sequences and code table files) is aligned in 512-bit units, padding with zeros if the last bit is less than 512 bits. Simultaneously, the weight values of all valid computation blocks in the structured sparse detection network are also organized according to the same memory access granularity. Each weight value is typically represented as an 8-bit integer or a 16-bit floating-point number, and all weight values are arranged contiguously and aligned in 512-bit units. These two parts of data are then merged to generate a complete binary instruction stream containing all the weight data and index information required for hardware inference.
[0048] The seventh step involves loading the binary instruction stream into the off-chip storage of the target hardware platform and loading the network structure parameters of the structured sparse detection network (such as the number of layers, the number of input / output channels per layer, and the convolution kernel size) into the hardware control register. After the target hardware platform starts, it captures network traffic data in real time and performs forward computation layer by layer in a pipelined manner. During each layer's computation, the hardware reads the corresponding sparse index table from the off-chip storage, parses the starting coordinates and block size of the effective computation blocks, and then reads only the weight values corresponding to these effective blocks from the weight storage, performing convolution operations with the input feature map. By avoiding the reading and computation of zero-value weights, the hardware can complete inference with a throughput close to the theoretical peak, achieving intrusion detection of real-time network traffic.
[0049] The working principle of this invention is as follows: An initial detection network obtained through unstructured pruning and an uncompressed original detection network are acquired. Both are then used to perform forward computation on the same network traffic sample to extract a first intermediate feature map and a second intermediate feature map from a preset intermediate layer. Next, local manifold embedding is performed on the first and second intermediate feature maps to obtain a low-dimensional embedding representation, and the difference matrix is calculated as the manifold offset. Then, a joint optimization objective is constructed based on the manifold offset to incrementally retrain the initial detection network. Gradient backpropagation is used to adjust its weight distribution and promote the convergence of the weights towards a block-like structured sparse pattern, thus obtaining a structured sparse detection network. Finally, the weight matrix of the structured sparse detection network is parsed using a structured sparse pattern to generate a corresponding sparse index table, which is then deployed together with the structured sparse detection network to a target hardware platform for intrusion detection of real-time network traffic.
[0050] The foregoing has provided a detailed description of one embodiment of the present invention, but this description is merely a preferred embodiment and should not be construed as limiting the scope of the invention. All equivalent variations and modifications made within the scope of the claims of this invention should still fall within the patent coverage of this invention.
Claims
1. A real-time network intrusion detection method based on deep learning, characterized in that, Includes the following steps: S1, Obtain an initial detection network obtained through unstructured pruning and an uncompressed raw detection network. Both the initial detection network and the raw detection network are used for real-time network intrusion detection. S2, perform forward computation on the same network traffic sample by the initial detection network and the original detection network respectively, and extract the first intermediate feature map and the second intermediate feature map from multiple preset intermediate layers; S3, perform local manifold embedding on the first intermediate feature map and the second intermediate feature map respectively to obtain their embedding representations in the low-dimensional manifold space, and calculate the difference matrix between the embedding representations as the manifold offset; S4. Based on the manifold offset, a joint optimization objective is constructed, and the initial detection network is incrementally retrained. The weight distribution is adjusted through gradient backpropagation to promote the convergence of the weights towards a block-structured sparse pattern, thus obtaining a structured sparse detection network. S5 performs structured sparse pattern parsing on the weight matrix of the structured sparse detection network to generate a corresponding sparse index table. The structured sparse detection network and the sparse index table are then deployed together on the target hardware platform to perform intrusion detection on real-time network traffic.
2. The real-time network intrusion detection method based on deep learning according to claim 1, characterized in that, S2 specifically includes: The saliency of the feature maps output by each intermediate layer of the original detection network is evaluated, and regions with response values exceeding a preset threshold are selected as key regions based on the saliency maps. Based on the spatial location of the key region in the original detection network, corresponding spatial transformation parameters are generated; The spatial transformation parameters are applied to the feature map of the corresponding intermediate layer of the initial detection network, and the spatial position of the feature map is calibrated to obtain the first intermediate feature map after calibration. The feature map of the intermediate layer of the original detection network is directly used as the second intermediate feature map.
3. The real-time network intrusion detection method based on deep learning according to claim 2, characterized in that, The process of obtaining the calibrated first intermediate feature map specifically includes: The spatial transformation parameters are analyzed to obtain the translation and scaling components; Based on the magnitude of the scaling component, an adaptive filtering window is determined. The feature maps of the corresponding intermediate layers of the initial detection network are then weighted and averaged using the adaptive filtering window to obtain smooth feature maps. Based on the translation component, a sampling grid is generated on the smoothed feature map, and each grid point in the sampling grid corresponds to a feature position on the calibrated feature map; Based on the coordinates of each grid point in the sampling grid, feature values are extracted from the smooth feature map using neighborhood interpolation to generate the calibrated first intermediate feature map.
4. The real-time network intrusion detection method based on deep learning according to claim 1, characterized in that, S3 specifically includes: A neighborhood graph is constructed on the first intermediate feature map, and the first similarity between each feature point and its neighboring feature points in the first intermediate feature map is calculated. A first adjacency matrix is generated based on the first similarity. A neighborhood graph is constructed for the second intermediate feature map, and the second similarity between each feature point and its neighboring feature points in the second intermediate feature map is calculated. A second adjacency matrix is generated based on the second similarity. Based on the first adjacency matrix and the second adjacency matrix, the low-dimensional embedding representations of the first intermediate feature map and the second intermediate feature map under the condition of preserving the local neighborhood structure are solved respectively to obtain the first embedding representation and the second embedding representation; Calculate the pointwise difference between the first embedding representation and the second embedding representation, and combine the pointwise differences into a difference matrix as the manifold offset.
5. The real-time network intrusion detection method based on deep learning according to claim 1, characterized in that, S4 specifically includes: The manifold offset is normalized to obtain a weighted mask; The weight mask is fused element-wise with the original gradient obtained by backpropagation of the initial detection network to generate the modulated gradient. The weights of the initial detection network are updated using the modulation gradient. During the update process, the weight matrix is scanned in blocks. When the average absolute value of the weights in the same block exceeds a preset threshold, all weights in the block are retained; otherwise, all weights in the block are reset to zero. The update process is repeated until convergence, resulting in a structured sparse detection network with a block-like clustering pattern in the weight distribution.
6. The real-time network intrusion detection method based on deep learning according to claim 5, characterized in that, The generation of the modulation gradient specifically includes: Gaussian smoothing of the weight mask in the spatial dimension yields a continuously distributed attention weight map. The attention weight map is multiplied element-wise with the original gradient to obtain the weighted gradient; To correct the orientation consistency of the weighted gradient, the gradient component in the weighted gradient that is opposite in direction to the original gradient is set to zero, thus obtaining the corrected gradient. The correction gradient is clipped to the maximum amplitude, limiting gradient values exceeding a preset amplitude threshold to the threshold value, thus generating a modulation gradient.
7. The real-time network intrusion detection method based on deep learning according to claim 1, characterized in that, S5 specifically includes: The weight matrix is divided into multiple equal-sized weight blocks by performing a block scan, and the distribution density of all non-zero weight values in each weight block is calculated. Based on the distribution density, weight blocks with a density exceeding a preset density threshold are selected as valid calculation blocks, and the starting coordinates and block size of each valid calculation block in the weight matrix are recorded. Huffman coding is performed on the starting coordinates and block size to generate a variable-length coding table. The variable-length coding table and the weight values of the effective computation blocks are combined to form a sparse index table. The sparse index table is aligned and reorganized according to the memory access granularity of the target hardware platform to generate a contiguous binary instruction stream. The binary instruction stream and the weight values of the structured sparse detection network are then loaded into the target hardware platform.
8. The real-time network intrusion detection method based on deep learning according to claim 7, characterized in that, The generation of the variable-length encoding table specifically includes: The frequency distribution of the starting coordinates of all valid computation blocks in the weight matrix is statistically analyzed, and the starting coordinates are assigned first codewords of different lengths according to their frequency. The frequency distribution of the block size of all valid computation blocks in the weight matrix is statistically analyzed, and second codewords of different lengths are assigned to the block size according to the frequency. The first and second codewords are concatenated bit by bit to form a mixed codeword, and the mixed codewords are arranged in the order of their appearance at the starting coordinates to form a codeword sequence; The codeword sequence is encapsulated with the code table file generated during the Huffman coding process. The code table file records the mapping relationship between each starting coordinate and each block size and its corresponding codeword. The encapsulated data is used as a variable-length encoding table.