UPF Function Acceleration Method Based on Programmable Hardware
By offloading the UPF to the DPU and adopting a hardware-software co-engineering architecture and a rule dependency resolution algorithm, the problems of excessive CPU resource consumption and low cache hit rate of the UPF in 6G edge networks are solved, achieving high throughput and low latency packet processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANKAI UNIV
- Filing Date
- 2025-06-04
- Publication Date
- 2026-06-30
Smart Images

Figure CN120640318B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of 5G mobile communication networks and relates to smart network card intranet programming technology. Background Technology
[0002] With the widespread development and adoption of fifth-generation mobile communication technology (5G), people's daily communication experience has been significantly improved. Compared with previous communication technologies, 5G not only provides higher network bandwidth and faster data transmission rates, but also offers lower latency and higher reliability. However, these technological advancements have been accompanied by an explosive growth in network data traffic. The emergence of new services such as high-definition video, augmented reality (AR), virtual reality (VR), connected vehicles, and the Industrial Internet of Things (IIoT) has posed unprecedented challenges to network bandwidth, latency, reliability, and service isolation. The traditional fourth-generation mobile communication network (4G) architecture is no longer able to adapt to such rapidly growing and diversified service demands.
[0003] In 4G networks, network elements that process data simultaneously undertake the dual tasks of user data forwarding and control plane management, resulting in a relatively complex network architecture. To address these bottlenecks and more effectively handle data traffic in 5G networks, the 3GPP (3rd Generation Partnership Project) organization officially released the 3GPP Release 15 standard in June 2018. This version is also the first official 5G network architecture specification.
[0004] 3GPP Release 15 clearly proposed the concept of Control and User Plane Separation (CUPS). The core idea of this architecture is to separate data forwarding and control functions in traditional network architectures into two independent network function entities to improve network flexibility, scalability, and maintainability. Against this backdrop, 3GPP formally defined a new network function entity—the User Plane Function (UPF)—to replace traditional 4G network elements.
[0005] UPF is specifically responsible for user data forwarding, routing, QoS (Quality of Service) policy enforcement, and billing data generation. Compared to traditional network nodes, UPF offers greater deployment flexibility, allowing for closer deployment to the user side and significantly reducing network latency. It also supports network virtualization and containerization, enabling more flexible responses to the needs of different services and application scenarios. The emergence of UPF greatly enhances the overall performance of 5G networks, meeting the demands of ultra-high traffic, ultra-low latency, and ultra-high reliability services, and providing strong support for digital transformation and innovation across various industries.
[0006] As 5G continues to evolve, there is an increasing emphasis on edge computing, especially in 6G network architecture, where edge computing, as a core component, is gradually becoming a key bearer mode of the mobile core network. This requires UPFs to have higher throughput, lower latency, and stronger dynamic scalability.
[0007] In cloud-native core network architectures, UPFs are typically deployed on x86-based server platforms and their performance is optimized through software acceleration frameworks such as DPDK, VPP, and eBPF. However, in real-world applications with high-density user access and dynamic traffic changes, UPFs still face significant performance bottlenecks, including excessive CPU resource consumption, frequent context switching, and low cache hit rates. Recent research has attempted to offload UPF functionality to hardware acceleration platforms such as P4 switches and smart NICs, aiming to further improve network performance while ensuring flexibility. However, while P4 switches offer significant advantages in throughput, their high cost and power consumption limit their large-scale deployment in edge scenarios. FPGA solutions, despite providing good flexibility and customization capabilities, also limit their practical application due to their high development complexity and cost. Furthermore, the current maximum throughput of mainstream P4 smart NICs is approximately 50Gbps, which is insufficient to meet the future 6G edge UPF processing performance requirements of 100Gbps and above. In comparison, the DPU (Data Processing Unit) boasts higher throughput, lower cost and power consumption, and superior programmability, making it an ideal hardware platform for 6G edge UPF deployments. Summary of the Invention
[0008] The purpose of this invention is to meet the requirements of high throughput, low latency, and large-scale user access in 6G edge networks using the UPF (User Processing Filter). By offloading the UPF from the server to the DPU (Data Processing Unit), server CPU resources are freed up. The DPU can then quickly process and forward data packets in the data plane, reducing end-to-end latency and improving throughput.
[0009] To achieve the above objectives, this invention provides a method for accelerating UPF functionality based on programmable hardware. The specific technical solution of this invention is as follows:
[0010] A method for accelerating UPF functionality based on programmable hardware involves offloading the UPF function from the server to the DPU, sending the data stream to the DPU hardware interface, and having the DPU process the data packets. The method includes the following steps:
[0011] Before the data stream transmission begins, the various rules in the User Plane Function (UPF) packet detection rules (PDR), forwarding action rules (FAR), usage report rules (URR), and QoS enforcement rules (QER) are stored in a configuration file in JSON format. When the program starts, it parses the configuration file and loads all the rules into the ARM memory on the DPU.
[0012] A hardware-software collaborative traffic processing mechanism is introduced, which performs hierarchical processing of data streams by combining software and hardware paths. When new traffic arrives and does not match the hardware forwarding rules, the data packet first enters the hardware processing module of the DPU. The hardware processing module automatically guides the data packet to the software path via the RSS mechanism. That is, the processing system based on the ARM architecture inside the DPU performs flow table rule matching on the data packet. If an existing rule is matched, the new rule is issued to the hardware to enable rapid matching and processing of subsequent traffic. After the rule is issued, the ARM processor performs encapsulation, decapsulation, and forwarding operations on the current data packet. For traffic for which the rule has been issued, the hardware path can directly complete high-speed matching without software intervention and forward the data packet after performing the corresponding encapsulation or decapsulation processing according to the rule instructions.
[0013] When subsequent data packets re-enter the DPU hardware, the hardware will route the data to the corresponding hardware processing pipeline based on the uplink or downlink traffic direction, and match it with the pre-issued PDR rules. For uplink data streams that match, the hardware will perform decapsulation operations according to the FAR rules associated with the PDR, restoring the data packets to the standard single-layer UDP format, and forwarding them through the specified exit port. For downlink data streams that have completed the PDR match, a GTP-U tunnel header is added to the beginning of the original data packet, and it is forwarded through the specified port.
[0014] Furthermore, the method of the present invention also includes: introducing a rule reading module into the processing system based on the ARM architecture inside the DPU. This module adopts a hash table storage format, using the IP address of each user equipment (UE) as the key of the hash table and the PDR rule data of the corresponding UE as the value, thereby realizing efficient indexing and management of rules.
[0015] Furthermore, the method described in this invention also includes: introducing a data packet parsing module into the ARM-based processing system inside the DPU. This module extracts key five-tuple field information, including source IP address, destination IP address, source port, destination port, and protocol type, as well as the specific structure information of the data packet, through structured parsing of the data packet header, to provide support for subsequent rule matching and processing.
[0016] Furthermore, the method of the present invention also includes: introducing a rule matching module into the processing system based on the ARM architecture inside the DPU. This module uses an efficient hash table lookup mechanism to perform precise matching and fuzzy matching on key fields such as source IP address, destination IP address, source port number and destination port number of the data packet, so as to quickly determine whether the data packet has a corresponding PDR rule entry, and finally determine the optimal matching rule corresponding to the data packet.
[0017] Furthermore, the method of the present invention also includes: introducing a traffic classification module into the ARM-based processing system inside the DPU, which efficiently classifies traffic using a Count-Min Sketch data structure. When a data packet arrives at the control plane, the module extracts its external IP address as a flow identifier and uses multiple sets of independent hash functions to map the identifier to a two-dimensional counting matrix, with the counter value at the corresponding position being incremented and updated. The system estimates the frequency of occurrence of the data flow in real time based on the minimum value of the counters mapped by each hash path. When the estimated value exceeds a set threshold, the flow is identified as a large flow, thereby triggering the corresponding hardware path offloading mechanism to achieve rapid processing and path optimization of large flows.
[0018] Furthermore, the method of the present invention also includes: introducing a rule processing module into the ARM architecture-based processing system inside the DPU, which dynamically determines whether a rule needs to be offloaded from the control plane to the data plane fast path based on the judgment result of the data flow scale by the traffic classification module, and executes a rule dependency removal algorithm before the rule is issued to generate isolated rules without rule dependencies and issue them to the hardware.
[0019] The rule dependency resolution algorithm first determines the dependency rule set by comparing the values of the bits where the optimal matching rule mask is 1 with the corresponding bits of other rules. Based on the dependency rule set, a materialization matrix is constructed to identify potential conflict bits. If a dependency rule has a mask of 1 at a bit where the optimal matching rule mask is 0, and the bit value is inconsistent with the actual bit value of the current data stream, then that bit is marked as a potential conflict bit. After the materialization matrix is constructed, the algorithm efficiently selects bits, traversing the materialization matrix row by row, skipping conflict rows already covered by the current mask. For uncovered conflict rows, the algorithm quickly extracts the conflict position corresponding to its least significant bit through bit operations and adds it to the candidate bit set, while updating the current mask to mark that the conflict row has been covered. This traversal is repeated until all necessary bits are selected. Finally, based on the selected materialization bits, an isolation rule is generated according to the optimal matching rule, setting the corresponding bit of the optimal matching rule mask to 1 and setting the corresponding bit of rule IP to a value consistent with the current data stream.
[0020] Furthermore, the method of the present invention also includes: introducing a packet forwarding module into the ARM architecture-based processing system inside the DPU. This module adopts the high-performance packet sending mechanism provided by the Data Plane Development Kit (DPDK) to achieve fast packet processing and forwarding. With the help of DPDK's user-space polling mode and zero-copy mechanism, the packet forwarding module can significantly reduce latency and system overhead during packet processing, effectively improving the data plane throughput performance of the UPF.
[0021] The present invention has the following advantages over the prior art:
[0022] This invention offloads the UPF (User-Defined Processing Function) function from the 5G core network to the DPU (Data Processing Unit), proposing a hardware-software co-processing architecture and introducing a rule dependency resolution algorithm and a traffic identification mechanism. Compared with existing solutions, this hardware-software co-processing architecture can fully leverage hardware acceleration capabilities to speed up packet processing. The traffic identification mechanism can accurately distinguish between large and small flows and adopts a "hardware processing for hot flows, software processing for cold flows" traffic splitting strategy, which not only improves the efficiency of large-flow processing but also reduces the frequent issuance of rules for small flows, saving valuable hardware resources. The rule dependency resolution algorithm generates isolated independent rules by stripping the dependencies between rules, effectively reducing the issuance of redundant rules while ensuring the accuracy of rule matching, further improving hardware resource utilization. The above mechanisms work together to significantly improve the data forwarding throughput performance of the UPF. Attached Figure Description
[0023] Figure 1 This is a flowchart illustrating the overall workflow of the present invention;
[0024] Figure 2 This is a comparison chart of the average throughput of the present invention and the traditional method;
[0025] Figure 3 This is a comparison graph showing the throughput of the present invention and the traditional method over time; Detailed Implementation
[0026] To more clearly illustrate the technical solution of the present invention, a detailed description will be provided below in conjunction with the accompanying drawings and examples.
[0027] The present invention proposes a UPF function acceleration method based on programmable hardware, the method being as follows:
[0028] A hardware-software co-processing architecture based on the DPU is proposed, fully leveraging the synergistic advantages of the fast and slow paths within the DPU platform. In the slow path, relying on the ARM processor on the DPU, DPDK is introduced to implement user-space packet parsing and forwarding, significantly reducing the overhead of frequent switching between kernel and user modes. In the fast path, only successfully matched rules are distributed to the BlueField-2 hardware within the DPU, accelerating packet processing and effectively reducing forwarding latency while ensuring high throughput performance.
[0029] A fast rule dependency removal algorithm is proposed, which can compute dependency-free rules in a very short time. This effectively solves the problem of hardware resource waste caused by complex dependencies between UPF rules, minimizes the storage requirements of redundant rules in the hardware, and improves the utilization efficiency of hardware resources.
[0030] A sketch-based traffic classification mechanism is designed to achieve efficient and accurate classification of network traffic. Considering the significant locality of network traffic, a small number of high-traffic flows account for the majority of the traffic load, and these high-traffic flows typically match only a few rules. Therefore, rules matching high-traffic flows are offloaded to the hardware path for accelerated processing, while low-traffic rules are handled by the software path. This avoids the waste of hardware resources and damage to forwarding performance caused by frequent offloading of low-traffic rules, thereby improving system throughput and hardware resource utilization.
[0031] The steps of the method include:
[0032] 1) Before data transmission begins, various rules for the User Plane Function (UPF), including Packet Detection Rule (PDR), Forwarding Action Rule (FAR), Usage Reporting Rule (URR), and QoS Enforcement Rule (QER), are stored in a configuration file in JSON format. Upon program startup, this configuration file is parsed, and all rules are loaded into the ARM memory on the DPU for subsequent packet matching and processing.
[0033] 2) When a new data stream arrives at the system, the data packet first enters the hardware processing module of the DPU. Since the corresponding matching rules are not yet configured in the hardware, the data packet is uploaded to the control plane through the Receive Side Scaling (RSS) mechanism after passing through a ten-stage processing pipeline. The control plane, based on the DPDK framework, continuously receives data packets to be processed using a polling method and performs rule table lookup operations to determine whether the data stream matches the loaded rule entries. For successfully matched data streams, the system then performs packet counting and traffic statistics, and completes the necessary processing and forwarding operations, which are also completed through DPDK. When the cumulative number of data packets in a data stream exceeds a preset threshold, the data stream is determined to be a large stream, and the control plane triggers the large stream rule offloading mechanism. This mechanism selects the optimal matching rule based on the characteristics of the current data packet, and analyzes the other rules it depends on to jointly generate the corresponding isolation rule. Finally, the isolation rule is issued to the hardware to achieve fast path forwarding of the data stream, avoiding subsequent data packets from being repeatedly uploaded to the control plane, thereby significantly improving overall processing efficiency and system throughput.
[0034] 3) When subsequent data packets re-enter the DPU hardware, the hardware automatically identifies the packet direction (i.e., uplink or downlink traffic) based on the packet header information and feature fields, and routes it to the corresponding hardware processing pipeline. In the hardware pipeline, the data packets are matched against pre-issued PDR rules. For successfully matched uplink data streams, the hardware performs decapsulation according to the FAR rules associated with the PDR, removing the GTP-U (GPRS Tunneling Protocol-UserPlane) tunnel header, restoring the data packet to the standard single-layer UDP format, and forwarding it through the designated egress port. For downlink data streams, after PDR matching is completed in the Ingress phase, the Egress phase performs encapsulation according to the FAR rules, adding a GTP-U tunnel header to the original data packet, and then forwarding it to the device through the designated port. The entire process is efficiently completed within the DPU's hardware path, avoiding control plane intervention, thus significantly improving the timeliness of data forwarding and system throughput performance.
[0035] Example:
[0036] The network architecture of this invention uses one server equipped with a 100Gbps NVIDIA BlueField-2 DPU and a 100Gbps Mellanox ConnectX-6 network card as the UPF system under test, and another server equipped with a 100Gbps Mellanox ConnectX-5 network card as the traffic generator. The traffic generation server runs DPDK and generates mobile network traffic based on the Pkt-Gen tool. The two servers are directly connected via a 100Gbps high-speed fiber optic cable to ensure unrestricted network transmission in the test environment.
[0037] Figure 1 The overall architecture of this invention is illustrated, consisting of a control plane (software implementation) and a data plane (programmable hardware implementation). Its data processing flow is divided into two categories: fast path and slow path. Data packets in the fast path are processed and forwarded directly in the data plane without going through the CPU, thus achieving higher forwarding efficiency. The slow path is suitable for data packets that cannot match existing rules in the data plane, requiring further processing by the CPU. In the fast path, when a data packet successfully matches a rule deployed in the data plane, processing and forwarding can be completed directly in the hardware without being sent to the control plane, greatly improving forwarding performance. In the slow path, data packets that do not match any rules are forwarded to the control plane via the RSS mechanism. The control plane then performs parsing, rule lookup, traffic classification, rule generation and unloading operations on the data packets, and finally completes the data packet transmission through DPDK.
[0038] The hardware-software co-processing architecture based on DPU proposed in this invention includes multiple modules in the control plane. These modules work together to achieve the functions, including a rule reading module, a packet parsing module, a rule matching module, a traffic classification module, a rule processing module, and a packet forwarding module.
[0039] The rule reading module is responsible for loading and managing various rules of the UPF, storing them in the ARM CPU memory using an efficient data structure. Specifically, this module uses a hash table storage format, with the IP address of each User Equipment (UE) as the key and the corresponding UE's PDR rule data as the value, to achieve efficient indexing and management of rules. When a data packet arrives, a hash table query is performed using the UE's IP address as the key, enabling fast rule location with O(1) time complexity, significantly reducing the latency of data packet rule matching.
[0040] The packet parsing module is responsible for analyzing and processing data packets uploaded from the data plane to the control plane. This module extracts key fields such as the five-tuple (source IP address, destination IP address, source port, destination port, and protocol type) and the specific structure of the data packet through structured parsing of the packet header. This provides support for subsequent rule matching and processing.
[0041] The rule matching module is responsible for performing packet detection rule matching operations based on the key field information extracted by the packet parsing module. Specifically, this module uses an efficient hash table lookup mechanism to perform precise and fuzzy matching on key fields such as the source IP address, destination IP address, source port number, and destination port number of the packet to quickly determine whether the packet has a corresponding PDR rule entry, and finally determine the optimal matching rule for the packet.
[0042] The traffic classification module is responsible for real-time statistics and analysis of the traffic to which data packets belong, and determines the traffic type (i.e., "large flow" or "small flow") based on the statistical results of the traffic scale. To this end, this invention proposes a real-time traffic classification mechanism based on the Count-MinSketch algorithm. The core strategy of this mechanism is to define data flows with a cumulative number of data packets exceeding the threshold as "large flows," and data flows with a cumulative number of data packets below the threshold as "small flows." Specifically, when a data packet arrives at the control plane, the traffic classification module first extracts the external IP address of the data packet as a key, uses multiple independent hash functions to map this key to a two-dimensional counter matrix in the Count-MinSketch algorithm, and updates the values of the corresponding counters. Subsequently, the module estimates the frequency of the data packets in real time based on the current minimum value of all corresponding counters. Once the estimated frequency of a data flow exceeds the predefined threshold, the system can classify it as a "large flow" in real time.
[0043] The main function of the rule processing module is to determine whether a rule needs to be offloaded from the control plane to the fast path of the data plane based on the traffic size assessment result of the traffic classification module, and further generate corresponding isolation rules to achieve efficient packet processing. When the traffic classification module determines that a data flow is a large flow and triggers a rule offloading operation, the rule processing module first analyzes whether other rules have potential dependencies on the optimal matching rule determined by the rule matching module. The criterion for identifying dependencies is: if the corresponding bit value of other rules is the same as that of the optimal matching rule at the bit where the mask value is 1, then these rules are determined to have a dependency relationship with the optimal matching rule.
[0044] After determining the set of dependent rules, the algorithm further constructs a specification matrix to identify which bit modifications can effectively resolve the dependencies between rules. The construction process of the specification matrix is as follows: At the bit where the mask value of the optimal matching rule is 0, if the mask of a dependent rule at that position is 1, and the corresponding bit value is inconsistent with the actual bit value of the current data stream, then that bit position is marked as a potential conflict bit and marked with 1 in the specification matrix.
[0045] After constructing the materialization matrix, the algorithm selects suitable bits for materialization operations based on the matrix information. To achieve a reasonable balance between algorithm accuracy and execution efficiency, this invention, taking into account the hardware characteristics of the ARM architecture in the DPU, proposes an efficient materialization bit selection algorithm using the inline function `__builtin_clz` provided by the GCC compiler. The algorithm first traverses the materialization matrix row by row, skipping conflicting rows already covered by the current mask. For uncovered conflicting rows, the algorithm quickly extracts the conflicting positions corresponding to their least significant bits through bitwise operations and adds them to the candidate bit set, while simultaneously updating the current mask to mark that the conflicting row has been covered. After selecting all necessary bits, the algorithm performs a final traversal of the materialization matrix to verify whether the currently selected mask effectively covers all conflicting rows, and returns the final processing result accordingly.
[0046] In the final step, based on the specific bit positions selected above, the algorithm replaces the bit values at the corresponding positions of the optimal matching rule with the actual bit values of the current data stream, thereby generating the final isolation rule. The corresponding bits of the optimal matching rule mask are set to 1, and the corresponding bits of the rule IP are set to values consistent with the current data stream. The proposed algorithm, while slightly sacrificing matching accuracy, effectively reduces the computational latency in the rule generation process and significantly improves the overall system throughput, thus better meeting the stringent real-time and performance requirements of edge network scenarios.
[0047] The packet forwarding module is primarily responsible for efficiently forwarding packets that have successfully matched rules within the control plane. This module utilizes the high-performance packet sending mechanism provided by the Data Plane Development Kit (DPDK) to achieve rapid packet processing and forwarding. Leveraging DPDK's user-space polling mode and zero-copy mechanism, the packet forwarding module significantly reduces latency and system overhead during packet processing, effectively improving the data plane throughput performance of the UPF.
[0048] To demonstrate the effectiveness of this invention, it is compared with Free5GC, a pure software UPF system implemented based on x86 architecture, and UPF-ACCEL, a pure hardware UPF system implemented based on BlueField DPU. Figure 2The average throughput performance of this invention, Free5GC, and UPF-ACCEL was demonstrated under different workload conditions. Experimental results show that this invention exhibits superior throughput under various transmission loads, especially in high-speed transmission scenarios. Under a load of 100Gbps, this invention achieved near-line-speed throughput of 94Gbps, while Free5GC and UPF-ACCEL only achieved 40Gbps and 82Gbps, respectively. Furthermore, Figure 3 The results demonstrate the throughput trend over time under a 100Gbps load. This invention exhibits faster performance ramp-up capabilities, quickly reaching peak throughput close to 100Gbps in a short period. In contrast, Free5GC and UPF-ACCEL have slower throughput ramp-up rates and require more time to reach a stable performance state.
[0049] Based on the above implementation methods, this invention offloads the UPF (User Packet Filtering) function, traditionally implemented on the CPU of a server, to the Data Processing Unit (DPU). By fully leveraging the architectural advantages of the DPU's hardware-software co-processing, it achieves a significant reduction in packet processing latency and an improvement in system throughput. By parsing and reconstructing dependent matching rules, it generates independent, dependency-free rules, reducing redundancy and conflicts in the hardware rule table. Simultaneously, a traffic classification mechanism is used to divide network traffic into large and small flows, mapping them to hardware and software paths respectively, maximizing the performance advantages of hardware-software co-processing. This invention can significantly optimize packet processing capabilities under high-load scenarios, effectively reduce latency bottlenecks, and improve overall system throughput and resource utilization.
[0050] It should be further noted that the above embodiments are only used for understanding the technical solution of the present invention and are not intended to limit the scope of protection of the present invention. Any obvious adjustments and improvements made to the technical solution of the present invention that are part of the technical concept of the present invention should also be within the scope of protection of the present invention.
Claims
1. A method for accelerating UPF functionality based on programmable hardware, characterized in that, This method offloads the UPF functionality from the server to the DPU, sending the data stream to the DPU hardware interface for packet processing. Specifically, it includes the following steps: Before the data stream transmission begins, the various rules in the User Plane Function (UPF) packet detection rules (PDR), forwarding action rules (FAR), usage report rules (URR), and QoS enforcement rules (QER) are stored in a configuration file in JSON format. When the program starts, it parses the configuration file and loads all the rules into the ARM memory on the DPU. A hardware-software collaborative traffic processing mechanism is introduced, which performs hierarchical processing of data streams by combining software and hardware paths. When new traffic arrives and does not match the hardware forwarding rules, the data packet first enters the hardware processing module of the DPU. The hardware processing module automatically guides the data packet to the software path via the RSS mechanism. That is, the processing system based on the ARM architecture inside the DPU performs flow table rule matching on the data packet. If an existing rule is matched, the existing rule is sent to the hardware to enable rapid matching and processing of subsequent traffic. After the rule is sent, the ARM processor performs encapsulation, decapsulation, and forwarding operations on the current data packet. For traffic for which the rule has been sent, the hardware path can directly complete high-speed matching without software intervention and forward the data packet after performing the corresponding encapsulation or decapsulation processing according to the rule instructions. When subsequent data packets re-enter the DPU hardware, the hardware will route the data to the corresponding hardware processing pipeline based on the uplink or downlink traffic direction, and match it with the pre-issued PDR rules. For uplink data streams that match, the hardware will perform decapsulation operations according to the FAR rules associated with the PDR, restoring the data packets to the standard single-layer UDP format, and forwarding them through the specified exit port. For downlink data streams that have completed the PDR match, a GTP-U tunnel header is added to the beginning of the original data packet, and it is forwarded through the specified port. The DPU's internal ARM-based processing system incorporates a rule processing module. This module dynamically determines whether a rule needs to be offloaded from the control plane to the data plane via a fast path based on the data flow scale assessment by the traffic classification module. Before issuing the rule, it executes a rule dependency removal algorithm to generate isolated rules without dependencies and then issues them to the hardware. The aforementioned rule dependency resolution algorithm first determines the dependency rule set. It then compares the values of bits where the optimal matching rule mask is 1 with the corresponding bits of other rules to determine if they are the same. Based on the dependency rule set, a materialization matrix is constructed to identify potential conflict bits. For bits where the optimal matching rule mask is 0, if the dependency rule mask is 1 at that position and the bit value is inconsistent with the actual bit value of the current data stream, that bit is marked as a potential conflict bit. After the materialization matrix is constructed, the algorithm efficiently selects bits, traversing the materialization matrix row by row, skipping conflict rows already covered by the current mask. For uncovered conflict rows, the algorithm quickly extracts the conflict position corresponding to its least significant bit through bit operations and adds it to the candidate bit set. Simultaneously, it updates the current mask to mark that the conflict row has been covered, repeating the traversal until all necessary bits are selected. Finally, based on the selected materialization bits, an isolation rule is generated according to the optimal matching rule. The corresponding bits of the optimal matching rule mask are set to 1, and the corresponding bits of rule IP are set to values consistent with the current data stream.
2. The method according to claim 1, characterized in that, The method further includes: introducing a rule reading module into the ARM-based processing system inside the DPU. This module adopts a hash table storage format, using the IP address of each user equipment (UE) as the key of the hash table and the corresponding UE's PDR rule data as the value, thereby achieving efficient indexing and management of rules.
3. The method according to claim 1, characterized in that, The method also includes: introducing a packet parsing module into the ARM-based processing system inside the DPU. This module extracts key five-tuple field information, including source IP address, destination IP address, source port, destination port and protocol type, as well as the specific structure information of the packet, through structured parsing of the packet header, to support subsequent rule matching and processing.
4. The method according to claim 1, characterized in that, The method also includes: introducing a rule matching module into the ARM-based processing system inside the DPU. This module uses an efficient hash table lookup mechanism to perform precise and fuzzy matching on the key fields of the source IP address, destination IP address, source port number, and destination port number of the data packet, so as to quickly determine whether the data packet has a corresponding PDR rule entry, and finally determine the optimal matching rule corresponding to the data packet.
5. The method according to claim 1, characterized in that, The method further includes: introducing a traffic classification module into the ARM-based processing system within the DPU, which efficiently classifies traffic using a Count-Min Sketch data structure. When a data packet arrives at the control plane, the module extracts its external IP address as a flow identifier and maps this identifier to a two-dimensional counting matrix using multiple independent hash functions, with the corresponding counter value being incremented and updated. The system estimates the frequency of occurrence of the data flow in real time based on the minimum value of the counters mapped by each hash path. When the estimated value exceeds a set threshold, the data flow is identified as a large flow, thereby triggering the corresponding hardware path offloading mechanism to achieve rapid processing and path optimization of large flows.
6. The method according to claim 1, characterized in that, The method also includes: introducing a packet forwarding module into the ARM-based processing system inside the DPU. This module adopts the high-performance packet sending mechanism provided by the Data Plane Development Kit (DPDK) to achieve fast packet processing and forwarding. With the help of DPDK's user-space polling mode and zero-copy mechanism, the packet forwarding module can significantly reduce latency and system overhead during packet processing, effectively improving the data plane throughput performance of the UPF.