A switching chip architecture for improving packet processing performance
By segmenting packets into headers and bodies and writing them in parallel to a shared buffer within the switching chip, the problem of low data bus bandwidth utilization is solved, improving the overall performance of the switching chip, especially the processing efficiency of short packets.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XEL TECH INC
- Filing Date
- 2026-04-30
- Publication Date
- 2026-06-19
AI Technical Summary
In existing switching chips, once the data bus width is determined, if the message length is exactly the data bus width plus a small offset, the data bus processing efficiency will decrease significantly, forming a processing bottleneck and affecting overall performance.
The ingress scheduling module divides the message to be processed into a message header and a packet body, and writes the message header and packet body into the shared buffer module at the same time. The buffer address is located by the message header pointer and the packet body pointer. The message body is written to the shared buffer unit in parallel using a multiplexer. The egress scheduling module reads the packet body in sequence, the merging module performs merging processing, and finally the message is serially output to the target port.
Without increasing the system clock frequency and data processing bit width, the overall performance of the switching chip was improved, the performance bottleneck of short messages was solved, and more efficient data exchange was achieved.
Smart Images

Figure CN122247952A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of network communication technology, and in particular to a switching chip architecture that improves message processing performance. Background Technology
[0002] In today's rapidly developing field of network communication, switching chips, as core components of network equipment, bear the heavy responsibility of efficient data exchange and processing. With the continuous increase in network bandwidth and the explosive growth of data traffic, the amount of data that switching chips need to process has increased dramatically, placing higher demands on data processing efficiency and flexibility. Switching chip architectures are typically designed with a separation of data and control paths to optimize data processing flows and improve overall performance. The control path is responsible for parsing packet headers, generating packet descriptors, and managing buffers, while the data path focuses on the actual transmission and scheduling of packets. The two work together to achieve efficient data exchange.
[0003] To address the ever-increasing demand for data processing, two main strategies are currently employed to enhance the data processing capabilities of switching chips. Firstly, increasing the system clock processing frequency increases the amount of data processed per unit time, thereby improving overall throughput. Secondly, increasing the bit width of the data processing bus allows each data transmission to carry more information, reducing the number of transmissions and thus improving data processing efficiency. These two approaches have alleviated data processing pressure to some extent and improved the performance of switching chips.
[0004] However, once the data bus width is determined, if the message length falls within the range of the data bus width plus a small offset (N bytes, where N is much smaller than the data bus width), the data bus processing efficiency will significantly decrease, creating a processing bottleneck. In this situation, the data bus cannot fully utilize its width advantage, resulting in wasted transmission cycles and consequently affecting the overall performance of the switching chip. Summary of the Invention
[0005] This application provides a switching chip architecture to improve message processing performance, thereby solving the technical problem that the data bus in existing switching chips cannot fully utilize its bit width advantage, resulting in the waste of some transmission cycles and thus affecting the overall performance of the switching chip.
[0006] The first aspect of this application provides a switching chip architecture for improving message processing performance, applied to a switching chip, wherein the difference between the input / output bandwidth and the core processing bandwidth of the switching chip is within a preset difference range, including: Ingress scheduling module, message parsing and processing module, shared cache module, egress scheduling module, and merging module; The ingress scheduling module is configured as follows: Receive the message to be processed and segment the message into a header and a body; The message header is sent to the message parsing module, and the message header is parsed and processed by the message parsing and processing module to obtain the packet header; Write the packet header and the packet body into the shared cache module; The export scheduling module is configured as follows: The header pointer and the first packet body pointer of the message to be processed are sent to the shared cache module. The header and the first packet body are read and written to the merging module. After the header and the first packet body are read, the remaining packets are read sequentially using the packet body pointer. The merging module is configured as follows: The packet head and the first packet body are merged and then serially output to the target port along with the remaining packet body.
[0007] In some embodiments, the shared cache module includes: Several input ports and a shared buffer unit; The input port is configured as follows: Read the package body and configure the cache ID; Record the cache ID of the first packet body, and configure the cache ID of the packet header based on the cache ID of the first packet body, so that the first packet body and the packet header are cached in different shared cache units; After the packet body and the packet header have been read, based on the cache ID, the packet header and the packet body are written into the corresponding shared cache unit using a multiplexer.
[0008] In some embodiments, the shared cache unit is configured with a packet header pointer and a packet body pointer; the packet header pointer and the packet body pointer are used to locate the cache addresses of the packet header and the packet body; The shared cache unit is also configured with a pointer FIFO; the pointer FIFO is used to record the corresponding packet body pointers according to the writing order of the remaining packet bodies when the remaining packet bodies are written into the shared cache unit; The export scheduling module is further configured as follows: The packet body is read sequentially using the pointer FIFO.
[0009] In some embodiments, the packet header pointer and the first packet body pointer of the message to be processed are not recorded in the pointer FIFO; The shared cache unit is configured as follows: The first packet body and the packet header are cached according to the cache ID of the first packet body of the message to be processed and the cache ID of the packet header, and the corresponding packet header pointer and the first packet body pointer of the message to be processed are sent to the outgoing scheduling module. After the packet header and the first packet body of the message to be processed are cached, the remaining packet bodies are cached, and the corresponding packet body pointers are recorded in the pointer FIFO according to the writing order of the remaining packet bodies.
[0010] In some embodiments, the message parsing and processing module is further configured to: The message header is parsed and processed to obtain a message descriptor; the message descriptor includes: the message length, destination port, and priority of the message to be processed; The system also includes: The cache management module is configured as follows: The message description is received and sent to the egress scheduling module; the egress scheduling module is configured to determine the sending order and sending time of the messages to be processed based on the message description.
[0011] A second aspect of this application provides a switching chip architecture for improving message processing performance, applied to a switching chip, wherein the difference between the input / output bandwidth and the core processing bandwidth of the switching chip is not within a preset difference range, including: The system comprises an ingress scheduling module, a message parsing and processing module, a shared buffer module, an egress scheduling module, and a merging module; the egress scheduling module includes a packet header scheduler and a packet body scheduler. The ingress scheduling module is configured as follows: Receive the message to be processed and segment the message into a header and a body; The message header is sent to the message parsing module, and the message header is parsed and processed by the message parsing and processing module to obtain the packet header; Write the packet header and the packet body into the shared cache module; The packet head scheduler is configured as follows: Send the packet header pointers for the current round and the next round to the shared cache module, and read the packet headers for the current round and the next round; The packet body scheduler is configured as follows: The pointer to the first packet body of the message to be processed is sent to the shared cache module. The first packet body of the message to be processed is read, and the header and the first packet body of the current round are written into the merging module. After the header and the first packet body of the message to be processed in the current and next rounds have been read, the remaining packets are read sequentially using the packet body pointer. The merging module is configured as follows: The packet header and the first packet body described in this round are merged and then serially output to the target port along with the remaining packet body.
[0012] In some embodiments, the shared cache module includes: Several input ports and a shared buffer unit; The input port is configured as follows: Read the package body and configure the cache ID; Record the cache ID of the first packet body, and configure the cache ID of the packet header based on the cache ID of the first packet body, so that the first packet body and the packet header are cached in different shared cache units; After the packet body and the packet header have been read, based on the cache ID, the packet header and the packet body are written into the corresponding shared cache unit using a multiplexer.
[0013] In some embodiments, a first preset number of first shared cache units in the shared cache module are fixedly used to cache the packet header; a second preset number of second shared cache units in the shared cache module are fixedly used to cache the packet body; a third preset number of third shared cache units in the shared cache module are fixedly used to cache either the packet body or the packet header, wherein when the third shared cache unit is used to store the packet body, the packet header cannot be stored; and when the third shared cache unit is used to store the packet header, the packet body cannot be stored. The shared cache unit is configured with a packet header pointer and a packet body pointer; the packet header pointer and the packet body pointer are used to locate the cache addresses of the packet header and the packet body; The shared cache unit is also configured with a pointer FIFO; the pointer FIFO is used to record the corresponding packet body pointers according to the writing order of the remaining packet bodies when the remaining packet bodies are written into the shared cache unit; The export scheduling module is further configured as follows: The packet body is read sequentially using the pointer FIFO.
[0014] In some embodiments, the packet header pointer and the first packet body pointer of the message to be processed are not recorded in the pointer FIFO; The shared cache unit is configured as follows: The first packet body and the packet header are cached according to the cache ID of the first packet body of the message to be processed and the cache ID of the packet header, and the corresponding packet header pointer and the first packet body pointer of the message to be processed are sent to the outgoing scheduling module. After the packet header and the first packet body of the message to be processed are cached, the remaining packet bodies are cached, and the corresponding packet body pointers are recorded in the pointer FIFO according to the writing order of the remaining packet bodies.
[0015] In some embodiments, the message parsing and processing module is further configured to: The message header is parsed and processed to obtain a message descriptor; the message descriptor includes: the message length, destination port, and priority of the message to be processed; The system also includes: The cache management module is configured as follows: The message description is received and sent to the egress scheduling module; the egress scheduling module is configured to determine the sending order and sending time of the messages to be processed based on the message description.
[0016] This application provides a switching chip architecture to improve message processing performance. Applied to a switching chip, the difference between the input / output bandwidth and the core processing bandwidth of the switching chip is within a preset range. The architecture includes: an ingress scheduling module, a message parsing and processing module, a shared buffer module, an egress scheduling module, and a merging module. The ingress scheduling module is configured to: receive messages to be processed and segment them into a message header and a packet body; send the message header to the message parsing module, where the message header is parsed and processed by the message parsing and processing module to obtain a packet header; and write the packet header and the packet body into the shared buffer. The storage module; the exit scheduling module is configured to: send the packet header pointer and the first packet body pointer of the message to be processed to the shared buffer module, read the packet header and the first packet body, and write the packet header and the first packet body into the merging module; after the packet header and the first packet body have been read, use the packet body pointer to read the remaining packet bodies in sequence; the merging module is configured to: merge the packet header and the first packet body, and serially output them with the remaining packet bodies to the target port, so as to improve the overall performance of the switching chip without increasing the system clock frequency and data processing bit width. Attached Figure Description
[0017] To more clearly illustrate the technical solution of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0018] Figure 1 This is a schematic diagram of the switching chip architecture for improving message processing performance in this application. Figure 2This is a schematic diagram of the shared cache module in one embodiment of this application; Figure 3 This is a schematic diagram of the shared cache module in another embodiment of this application; Figure 4 This is a schematic diagram of the current switching chip architecture; Figure 5 This is the first linear graph showing the relationship between line bandwidth and packet length under the current switching chip architecture; Figure 6 This is the second linear graph showing the relationship between line bandwidth and packet length under the current switching chip architecture; Figure 7 This is a third linear graph showing the relationship between line bandwidth and packet length under the current switching chip architecture; Figure 8 This is the fourth linear graph showing the relationship between line bandwidth and packet length under the current switching chip architecture; Figure 9 This is the fifth line graph showing the relationship between line bandwidth and packet length under the current switching chip architecture; Figure 10 This is the sixth linear diagram showing the relationship between line bandwidth and packet length under the current switching chip architecture; Figure 11 This is a schematic diagram illustrating the elimination of performance troughs in the switching chip architecture of this application; Figure 12 This is a performance trend diagram of the switching chip architecture in this application after solving the trough and best-effort problems; Figure 13 This is a schematic diagram illustrating the packet rate of a 133-byte message when the IO bandwidth is 480G in this application; Figure 14 This is a schematic diagram of the packet rate of a 64-byte message when the IO bandwidth is 480B in this application.
[0019] Explanation of reference numerals in the attached figures: 1-Ingress scheduling module; 2-Message parsing and processing module; 3-Shared buffer module; 31-Input port; 32-Shared buffer unit; 4-Outgress scheduling module; 41-Packet header scheduler; 42-Packet body scheduler; 5-Merging module; 6-Buffer management module. Detailed Implementation
[0020] To enable those skilled in the art to better understand the technical solutions in this application, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this application.
[0021] For example, in current network applications, switching chip architectures are typically divided into data paths and control paths. The control path generates message descriptors and other information based on the message header parsing results, and performs buffer management based on these descriptors, ultimately cooperating with the data path to complete message scheduling and output. The data path is responsible for input port scheduling, writing messages to the shared buffer according to the data bus width, and reading messages from the shared buffer at the data bus granularity according to the output port and priority. Once the data bus width is determined, a bottleneck in the switching chip's data bus processing may occur when the message length is typically between the data bus width and N bytes (where N is much smaller than the data bus width). Existing switching chip architectures, such as... Figure 4 As shown. Currently, the usual solutions are to increase the system clock processing frequency or increase the data processing bus width.
[0022] The following explanation uses a system clock frequency of 500MHz and a data bus width of 1024bit as an example.
[0023] In scenarios where I / O bandwidth and internal core bandwidth are similar: Taking a 320Gbps Ethernet line bandwidth as an example, when the packet length (including 4B FCS) is in the range of 133-139 bytes, the switching performance is significantly insufficient, which will cause packets of this length to be dropped due to insufficient processing performance.
[0024] The calculation method is as follows: 133 bytes - 4 bytes FCS = 129 bytes of message. Processing a 129-byte message with a 1024-bit data width requires two clock signals (clk). However, the second clock signal can only process 1 byte of message. Therefore, in this case, the processing capacity of the switching chip is: 250M×1024bit+250M×8bit=258Gbps. However, after removing the 20-byte interframe gap and preamble, the effective input bandwidth of a 320Gbps line bandwidth is 261.44Mpps×129 bytes×8=269.8Gbps.
[0025] As above Figure 5 As shown, the internal processing capability is significantly insufficient when the message length is in the inflection point of 133-139 bytes, and the speedup ratio of messages near this length is extremely small, which also poses a great challenge to the internal design.
[0026] There are currently two solutions: 1. Increase the system clock frequency of the switching chip, for example, by increasing the system clock frequency in the above scenario to 550MHz. This will accelerate the internal processing capabilities and improve the performance bottleneck of the inflection point message. A performance diagram after increasing the frequency to 550MHz is shown below. Figure 6 As shown.
[0027] Given the selected process technology (e.g., SMIC 40nm), the above solution presents new challenges to the timing of internal logic, requiring more compact timing design, and also poses certain challenges to back-end PR (placement and routing).
[0028] 2. Increase the data bus width. While keeping the system clock frequency constant, increasing the data bus width can improve internal processing performance, achieving full byte line speed. Figure 7 As shown. For example, increasing the bit depth from 1024 bits to 1152 bits can achieve line speed for all bytes.
[0029] The above solution is feasible when the difference between IO bandwidth and core bandwidth is not significant, and the bus width expansion is minimal. However, if the IO bandwidth is larger, a larger data bus width is required, which will pose a significant challenge to performance metrics.
[0030] In scenarios where the IO bandwidth is greater than the internal core bandwidth: Taking a 480Gbps Ethernet line bandwidth as an example, when the packet length (including 4B FCS) is in the range of 128 bytes + N bytes (N is less than 128 bytes), the switching performance is significantly insufficient, which will cause packets of this length to be dropped due to insufficient processing performance.
[0031] The calculation method is as follows: Taking 133 bytes as an example, 133 bytes - 4 bytes FCS = 129 bytes of message. Processing a 129-byte message with a 1024-bit data width requires 2 clock cycles (clk). However, the second clock cycle can only process 1 byte of message. Therefore, in this case, the processing capacity of the switching chip is: 250M×1024bit+250M×8bit=258Gbps. However, after removing the 20-byte interframe gap and preamble, the effective input bandwidth of a 480Gbps line bandwidth is 392.16Mpps×129 bytes×8=404.7Gbps.
[0032] like Figure 8 As shown, when the message length is in the inflection point of 128 bytes + N bytes (N is less than 128 bytes), the internal processing capability is significantly insufficient, and the speedup ratio of messages near this length is extremely small.
[0033] There are currently two solutions: 1. Increase the system clock frequency of the switching chip, for example, by increasing the system clock frequency to 800MHz in the above scenario. This will accelerate internal processing capabilities and improve the performance bottleneck of inflection point packets. A performance diagram after increasing the frequency to 800MHz is shown below. Figure 9 As shown. However, this solution is difficult to implement with a 40nm process and requires a more advanced process, which would significantly increase the cost of the chip.
[0034] 2. Increase the data bus width. While keeping the system clock frequency constant, increasing the data bus width can improve internal processing performance to achieve line-rate across all bytes. Figure 10 As shown. However, the presence of a large number of Cross-Bar structures between the ingress and egress scheduling and the shared cache bank, increasing the data bus to 1800 bits, makes backend PR extremely difficult, or even impossible to complete smoothly.
[0035] To address the technical problem that the data bus in the aforementioned switching chips cannot fully utilize its bit width advantage, resulting in wasted transmission cycles and thus affecting the overall performance of the switching chip, this application provides a switching chip architecture that improves message processing performance. The switching chip architecture that improves message processing performance is described below: The conventional solution experiences a performance bottleneck in the 133-139 byte range because these message lengths require two clock cycles for data bus transmission, with particularly low data utilization in the second clock cycle. For messages between 140 and 260 bytes, there is no performance bottleneck because the second clock cycle has higher data bus utilization. Even with longer message lengths requiring more than two clock cycles for data transmission, the overall utilization is higher than in the 133-19 byte range because the first N-1 clock cycles utilize the data bus 100%.
[0036] This application writes the first slice of the packet header and the packet body to the shared cache and reads it out of the shared cache simultaneously, thus solving the short packet performance problem that exists in conventional solutions.
[0037] like Figure 1 The diagram shown is a schematic representation of the switching chip architecture for improving message processing performance in this application.
[0038] The first aspect of this application provides a switching chip architecture for improving message processing performance, applied to a switching chip, wherein the difference between the input / output bandwidth and the core processing bandwidth of the switching chip is within a preset difference range, including: Module 1: Ingress Scheduling; Module 2: Message Parsing and Processing; Module 3: Shared Buffer; Module 4: Outgress Scheduling; Module 5: Merging. The ingress scheduling module 1 is configured as follows: Receive the message to be processed and divide the message into a header and a body.
[0039] Specifically, when a message enters the switching chip, it is first received by the ingress scheduling module 1. The ingress scheduling module 1 performs preliminary preprocessing on the message, such as checking its integrity and checksum, to ensure that the message is not corrupted or erroneous. Next, the control path begins parsing the message header. The control path extracts key information such as the source MAC address, destination MAC address, and message type by reading the first few bytes of the message (the specific length depends on the message type; for example, Ethernet message headers are typically 14 bytes). This information is used to generate a message descriptor, which contains the message's metadata, such as message length, input port, output port, and priority, providing a basis for subsequent buffer management and scheduling output. After parsing the message header, the control path needs to determine the boundary between the message header and the packet body based on the message type and length information. For fixed-length message headers (such as Ethernet message headers), this process is relatively simple; it only requires reading the fixed-length bytes according to the message type. For message headers with variable lengths (such as some higher-level protocol messages), the control path needs to dynamically calculate the length of the message header based on the length field or other identification information in the message header, thereby determining the starting position of the packet body in order to segment the message to be processed into a message header and a packet body.
[0040] The message header is sent to the message parsing module 2. After the message header is parsed and processed by the message parsing and processing module 2, the packet header is obtained.
[0041] Specifically, the control path reads fixed-length (e.g., 14 bytes for an Ethernet header) or variable-length (dynamically determined based on the packet type field) byte data from the packet's start position, parsing out key information such as: source / destination address (e.g., MAC address, IP address), protocol type (e.g., TCP / UDP / ICMP), packet length (used to locate the packet's start position), and priority / VLAN tag (used for QoS scheduling). After parsing, the control path generates a packet descriptor containing the packet's metadata (e.g., input port, output port, priority, etc.), but the packet header still exists in its original binary form. The control path performs necessary processing on the parsed raw packet header: field modification (e.g., decrementing TTL, updating checksums), security verification (e.g., IPSec / MACsec encryption verification), and tag addition (e.g., VLAN tag, MPLS tag). The processed packet header is encapsulated into an internal data structure (i.e., the packet header), the format of which may vary depending on the chip design, but typically includes the original packet header fields and additional metadata (e.g., timestamp, scheduling policy identifier).
[0042] The message parsing and processing module 2 is also configured to: The message header is parsed and processed to obtain a message descriptor; the message descriptor includes: the message length, destination port, and priority of the message to be processed.
[0043] Specifically, in the data path and control path of the switching chip, the control path is responsible for parsing the packet header. This process involves extracting and analyzing the information in the input packet header to identify key information such as the packet's destination address, source address, and protocol type. Based on the parsing results of the packet header, the control path generates a packet descriptor (i.e., a packet descriptor file). This descriptor contains key information about the packet, such as packet length, destination port, and priority. The generated packet descriptor is used for cache management. Specifically, the control path determines how to allocate cache space, when to write packets to the cache, and when to read packets from the cache based on the information in the packet descriptor. At the same time, the cache management module also updates the packet status information based on the packet descriptor, such as whether the packet has been fully received and whether it is ready to be sent.
[0044] The message descriptor serves the following purposes: Guided Cache Allocation: The information in the message descriptor guides the cache management module on how to allocate cache space for messages. For example, based on message length and priority, the cache management module can determine how much cache resources to allocate to a message to ensure that the message can be stored and processed effectively.
[0045] Support for message scheduling: The message descriptor is a crucial basis for the egress scheduling module to perform message scheduling. The egress scheduling module determines the sending order and time of messages based on the destination port and priority information in the message descriptor, so as to achieve fair and orderly transmission of messages.
[0046] The packet header and the packet body are written into the shared cache module 3.
[0047] The export scheduling module 4 is configured as follows: The header pointer and the first packet body pointer of the message to be processed are sent to the shared cache module 3. The header and the first packet body are read and written to the merging module 5. After the header and the first packet body are read, the remaining packets are read sequentially using the packet body pointer.
[0048] The merging module 5 is configured as follows: The packet head and the first packet body are merged and then serially output to the target port along with the remaining packet body.
[0049] The system also includes: Cache management module 6, wherein the cache management module 6 is configured as follows: The message description is received and sent to the egress scheduling module 4; the egress scheduling module 4 is configured to determine the sending order and sending time of the messages to be processed based on the message description.
[0050] This application provides a switching chip architecture to improve message processing performance. After ingress scheduling, the process is divided into two paths: a data path and a control path. The control path, after parsing and processing the message header, sends the header to a shared buffer instead of discarding it as in current solutions. The data path only writes the packet body (excluding the header) to the shared buffer. For the shared buffer, at the start of a message, the logic simultaneously writes the header and the first packet body into the shared buffer in parallel. The egress scheduling simultaneously sends the header pointer and the first packet body pointer to the shared buffer to obtain its address, outputs two data streams in parallel, and then merges the header and the first packet body before finally serially outputting them to the target port.
[0051] The challenges in the above solution are how the shared cache can ensure that the two data buses are written to two different Bank caches in parallel, and how the two write operations can avoid the two read operations (considering area and power consumption, the Bank caches all use single-port SPRAM). The following will introduce the above challenges in detail.
[0052] With similar IO bandwidth and Core bandwidth processing capabilities, all incoming packets from all ports can achieve line-rate input scheduling, internal switching, and output scheduling. The solution is as follows: Figure 2 The diagram shown is a structural schematic of the shared cache module in one embodiment of this application.
[0053] In this embodiment, the shared cache module 3 includes: Several input ports 31, shared buffer unit 32.
[0054] The input port 31 is configured as follows: Read the packet body and configure a cache ID; record the cache ID of the first packet body, and configure the cache ID of the packet header based on the cache ID of the first packet body, so that the first packet body and the packet header are cached in different shared cache units 32; after the packet body and the packet header are read, write the packet header and the packet body into the corresponding shared cache unit 32 based on the cache ID using a multiplexer.
[0055] The shared cache unit 32 is configured with a packet header pointer and a packet body pointer; the packet header pointer and the packet body pointer are used to locate the cache addresses of the packet header and the packet body.
[0056] The shared cache unit 32 is also configured with a pointer FIFO; the pointer FIFO is used to record the corresponding packet body pointers according to the writing order of the remaining packet bodies when the remaining packet bodies are written into the shared cache unit 32.
[0057] The export scheduling module 4 is further configured as follows: The packet body is read sequentially using the pointer FIFO.
[0058] In this embodiment, the packet header pointer and the first packet body pointer of the message to be processed are not recorded in the pointer FIFO.
[0059] The shared cache unit 32 is configured as follows: The first packet body and the packet header are cached according to the cache ID of the first packet body of the message to be processed and the cache ID of the packet header, and the corresponding packet header pointer and the first packet body pointer of the message to be processed are sent to the outgoing scheduling module 4.
[0060] After the packet header and the first packet body of the message to be processed are cached, the remaining packet bodies are cached, and the corresponding packet body pointers are recorded in the pointer FIFO according to the writing order of the remaining packet bodies.
[0061] Specifically, considering area and power consumption, all shared cache units 32 are implemented using SPRAM. Read operations have the highest priority in shared cache units 32, and write operations to shared cache units 32 must use "bend logic" to avoid read operations.
[0062] For the write side, in terms of timing, the packet body is first written to shared buffer unit 32, and the packet header is written to shared buffer unit 32 after "packet parsing and processing". Since the read operation side needs to read both the packet header and the first packet body simultaneously, the write side needs to write the packet header and the first packet body to different shared buffer units 32 to avoid conflicts on the read side. After the first packet body of each ingress port is written to a certain shared buffer unit 32, the logic records the buffer ID of the first packet body written to each port. When the packet header is written to shared buffer unit 32, the logic controls not to write to the shared buffer unit 32 corresponding to the first packet body of that packet.
[0063] For the "avoidance logic," each shared cache unit 32 corresponds to a free pointer management module, which is used to allocate free cell pointers to each shared cache unit 32. Each shared cache unit 32 is also equipped with a small-depth (e.g., depth 4 or 8) free pointer prediction FIFO. This FIFO anticipates the free pointers of the corresponding shared cache unit 32 in advance for writing the packet header and body to the address corresponding to the shared cache unit 32. When the shared cache unit 32 is blocked due to read priority or packet header avoidance, the free pointers of this FIFO are also blocked and not used during writing.
[0064] On the read side, when the outbound scheduling module 4 initiates the first scheduling of a packet, it simultaneously reads the packet header and the first packet body from two different shared buffer units 32. Subsequent packet bodies are read by finding the shared buffer unit 32 through the Cell pointer. After the previous packet is read, the next packet is read in the same order.
[0065] Since both writing and reading from shared cache unit 32 are performed simultaneously on the packet header and the first packet body, therefore... Figure 11 The troughs are all "cut off", improving performance without increasing the system clock frequency and data bus width.
[0066] A second aspect of this application provides a switching chip architecture for improving message processing performance, applied to a switching chip, wherein the difference between the input / output bandwidth and the core processing bandwidth of the switching chip is not within a preset difference range, including: The system comprises an ingress scheduling module 1, a message parsing and processing module 2, a shared cache module 3, an egress scheduling module 4, and a merging module 5; the egress scheduling module 4 includes a packet header scheduler 41 and a packet body scheduler 42.
[0067] The ingress scheduling module 1 is configured as follows: The system receives a message to be processed and segments it into a header and a body. The header is sent to the message parsing module 2. After being parsed and processed by the message parsing and processing module 2, the header is obtained. The header and body are then written into the shared cache module 3.
[0068] The message parsing and processing module 2 is also configured to: The message header is parsed and processed to obtain a message descriptor; the message descriptor includes: the message length, destination port, and priority of the message to be processed.
[0069] The shared cache module 3 includes: Several input ports 31, shared buffer unit 32.
[0070] The input port 31 is configured as follows: Read the packet body and configure a cache ID; record the cache ID of the first packet body, and configure the cache ID of the packet header based on the cache ID of the first packet body, so that the first packet body and the packet header are cached in different shared cache units 32; after the packet body and the packet header are read, write the packet header and the packet body into the corresponding shared cache unit 32 based on the cache ID using a multiplexer.
[0071] The packet head scheduler 41 is configured as follows: Send the header pointers for the current and next rounds to the shared cache module 3, and read the headers for the current and next rounds.
[0072] The packet body scheduler 42 is configured to: The pointer to the first packet body of the message to be processed is sent to the shared cache module 3. The first packet body of the message to be processed is read, and the header and the first packet body of the current round are written into the merging module 5. After the header and the first packet body of the message to be processed in the current and next rounds are read, the remaining packets are read sequentially using the packet body pointer.
[0073] The merging module 5 is configured as follows: The packet header and the first packet body described in this round are merged and then serially output to the target port along with the remaining packet body.
[0074] The system also includes: Cache management module 6, wherein the cache management module 6 is configured as follows: The message description is received and sent to the egress scheduling module 4; the egress scheduling module 4 is configured to determine the sending order and sending time of the messages to be processed based on the message description.
[0075] For example, when the IO bandwidth is greater than the chip's core bandwidth, the bandwidth of all port inputs is much greater than the processing bandwidth inside the switching chip. When the packet length is short, only some ports can be guaranteed to operate at line speed, while others operate at best. In this situation, there is usually no internal scheduling bottleneck when the packet length is greater than the data bus width (e.g., Figure 13 The packet rate is 392 Mpps when the packet length is 133 bytes (which is less than the system clock frequency). Smaller packets, such as 64-96 bytes, present scheduling bottlenecks, as shown below. Figure 14 As shown, at this time, the two data buses can only use the data bus of the packet header to write to the shared buffer unit 32. For example, if the IO bandwidth is 480Gbps and the packet rate of a 64-byte message is 714.286Mpps, which far exceeds the system clock processing capacity of 500M, some ports can only be scheduled to do their best.
[0076] Because of the best-effort scheduling port, when scheduling the packet body of the previous packet, the header of the subsequent packet will be scheduled in an effort. The shared buffer unit 32 is divided into a first shared buffer unit, a second shared buffer unit, and a third shared buffer unit, as detailed in the specific scheme. Figure 3 The diagram shown is a structural schematic of the shared cache module in another embodiment of this application.
[0077] In this embodiment, a first preset number of first shared cache units in the shared cache module 3 are fixedly used to cache the package header; a second preset number of second shared cache units in the shared cache module 3 are fixedly used to cache the package body; and a third preset number of third shared cache units in the shared cache module 3 are fixedly used to cache either the package body or the package header. When the third shared cache unit is used to store the package body, the package header cannot be stored; when the third shared cache unit is used to store the package header, the package body cannot be stored.
[0078] The shared cache unit 32 is configured with a packet header pointer and a packet body pointer; the packet header pointer and the packet body pointer are used to locate the cache addresses of the packet header and the packet body.
[0079] The shared cache unit 32 is also configured with a pointer FIFO; the pointer FIFO is used to record the corresponding packet body pointers according to the writing order of the remaining packet bodies when the remaining packet bodies are written into the shared cache unit 32.
[0080] The export scheduling module 4 is further configured as follows: The packet body is read sequentially using the pointer FIFO.
[0081] The packet header pointer and the first packet body pointer of the message to be processed are not recorded in the pointer FIFO.
[0082] The shared cache unit 32 is configured as follows: The first packet body and the packet header are cached according to the cache ID of the first packet body of the message to be processed and the cache ID of the packet header. The corresponding packet header pointer and the pointer of the first packet body of the message to be processed are sent to the exit scheduling module 4. After the packet header and the first packet body of the message to be processed are cached, the remaining packets are cached, and the corresponding packet body pointers are recorded in the pointer FIFO according to the writing order of the remaining packets.
[0083] Specifically, considering area and power consumption, all shared cache units 32 are implemented using SPRAM. Read operations have the highest priority in shared cache units 32, and write operations to shared cache units 32 must use "bend logic" to avoid read operations.
[0084] Based on the above analysis, packet body scheduling has no bottleneck, while packet head scheduling does. Packet head scheduling uses best-effort scheduling, while packet body scheduling uses ordinary calendar scheduling. Therefore, packet head and packet body each correspond to two independent schedulers. The packet head scheduler can configure some ports as line-speed scheduling ports and some ports as best-effort scheduling ports. For best-effort ports, best-effort scheduling is performed when internal scheduling / bus is idle.
[0085] For the write side, in terms of timing, the packet body is written to the shared buffer unit 32 first, and the packet header is written to the shared buffer unit 32 after "packet parsing and processing". Since the read operation side needs to read both the packet header and the first packet body simultaneously, the write side needs to write the packet header and the first packet body to different shared buffer units 32 to avoid conflicts on the read side. After the first packet body of each ingress port is written to a certain shared buffer unit 32, the logic records the buffer ID of the first packet body written to each port. When the packet header is written to the shared buffer unit 32, the logic controls not to write to the shared buffer unit 32 corresponding to the first packet body of that packet.
[0086] For the "avoidance logic," each shared cache unit 32 corresponds to a free pointer management module, which is used to allocate free cell pointers to each shared cache unit 32. Each shared cache unit 32 is also equipped with a small-depth (e.g., depth 4 or 8) free pointer prediction FIFO. This FIFO anticipates the free pointers of the corresponding shared cache unit 32 in advance for writing the packet header and body to the address corresponding to the shared cache unit 32. When the shared cache unit 32 is blocked due to read priority or packet header avoidance, the free pointers of this FIFO are also blocked and not used during writing.
[0087] For the read side, when the outbound scheduling module initiates the first scheduling of the message, it reads the packet header and the first packet body from two different shared buffer units 32 at the same time. Subsequent packet bodies can be read by finding the shared buffer unit 32 through the Cell pointer.
[0088] Because of the best-effort scheduling port, in order to "do its best" when scheduling the packet body of the preceding long packet, the packet header scheduler can simultaneously schedule the header of the next packet (especially small packets smaller than the bus width). This can improve the internal speedup ratio and thus achieve the purpose of the best-effort port "doing its best".
[0089] If the first scheme allows both the packet header and body to share all shared cache units 32, a problem arises: when both the packet head and body are scheduled for output simultaneously, since the packet header on the write side is only in a different shared cache unit 32 from the first packet body of the current message, the shared cache unit 32 where the header of subsequent messages is located may be the same as the packet body of the previous long packet, leading to a conflict when reading the shared cache unit 32. Therefore, in the shared cache, N banks are fixedly allocated as HeadBanks (first shared cache units) for storing packet headers, M banks are fixed as BodyBanks (second shared cache units) for storing packet bodies, and Y banks are used as dynamically allocated banks for Head and Body (third shared cache units). Although the dynamically allocated third shared cache units can be dynamically allocated, they must be used as fixed Head or Body Banks after being written to, and can only be dynamically allocated again for Head or Body purposes after the entire cache is released.
[0090] Following the above solution, the performance bottleneck of the trough can be eliminated, and the performance issues of small packets such as 64 bytes can be addressed. By relying on the best-effort scheduling of the HeadBank, the line speed for small packets on some ports can be guaranteed. Figure 12 As shown.
[0091] It is worth noting that the effects of the components in the switching chip architecture that improve message processing performance in the second aspect mentioned above can be found in the effects of the components in the switching chip architecture that improve message processing performance in the first aspect mentioned above, and will not be repeated here.
[0092] This application provides a switching chip architecture to improve packet processing performance. In the switching / routing process, the only way to address the performance bottleneck of inflection point packets is to increase the system clock frequency and data processing bit width. However, these two methods pose significant challenges to process technology, timing design, and backend PR, and may even be impossible to implement. This application solves the performance bottleneck of inflection point packets without increasing the system clock frequency or data processing bit width by optimizing the switching architecture within the switching chip.
[0093] The above detailed embodiments further illustrate the purpose, technical solution, and beneficial effects of the embodiments of this application. It should be understood that the above are merely specific embodiments of the embodiments of this application and are not intended to limit the protection scope of the embodiments of this application. Any modifications, equivalent substitutions, improvements, etc., made on the basis of the technical solutions of the embodiments of this application should be included within the protection scope of the embodiments of this application.
Claims
1. A switching chip architecture for improving message processing performance, applied to a switching chip, wherein the difference between the input / output bandwidth and the core processing bandwidth of the switching chip is within a preset difference range, characterized in that, include: Ingress scheduling module (1), message parsing and processing module (2), shared cache module (3), egress scheduling module (4), merging module (5); The entry scheduling module (1) is configured as follows: Receive the message to be processed and segment the message into a header and a body; The message header is sent to the message parsing module (2). After the message header is parsed and processed by the message parsing and processing module (2), the packet header is obtained. Write the packet header and the packet body into the shared cache module (3); The export scheduling module (4) is configured as follows: Send the packet header pointer and the first packet body pointer of the message to be processed to the shared cache module (3), read the packet header and the first packet body, and write the packet header and the first packet body into the merging module (5); after the packet header and the first packet body have been read, use the packet body pointer to read the remaining packet bodies in sequence; The merging module (5) is configured as follows: The packet head and the first packet body are merged and then serially output to the target port along with the remaining packet body.
2. The switching chip architecture for improving message processing performance according to claim 1, characterized in that, The shared cache module (3) includes: Several input ports (31), shared buffer unit (32); The input port (31) is configured as follows: Read the package body and configure the cache ID; Record the cache ID of the first packet body, and configure the cache ID of the packet header based on the cache ID of the first packet body, so that the first packet body and the packet header are cached in different shared cache units (32); After the packet body and the packet header have been read, based on the cache ID, the packet header and the packet body are written into the corresponding shared cache unit (32) using a multiplexer.
3. The switching chip architecture for improving message processing performance according to claim 2, characterized in that, The shared cache unit (32) is configured with a packet header pointer and a packet body pointer; the packet header pointer and the packet body pointer are used to locate the cache addresses of the packet header and the packet body; The shared cache unit (32) is also configured with a pointer FIFO; the pointer FIFO is used to record the corresponding packet body pointer according to the writing order of the remaining packet bodies when the remaining packet bodies are written into the shared cache unit (32); The export scheduling module (4) is further configured as follows: The packet body is read sequentially using the pointer FIFO.
4. The switching chip architecture for improving message processing performance according to claim 3, characterized in that, The packet header pointer and the first packet body pointer of the message to be processed are not recorded in the pointer FIFO; The shared cache unit (32) is configured as follows: The first packet body and the packet header are cached according to the cache ID of the first packet body of the message to be processed and the cache ID of the packet header, and the corresponding packet header pointer and the first packet body pointer of the message to be processed are sent to the outgoing scheduling module (4). After the packet header and the first packet body of the message to be processed are cached, the remaining packet bodies are cached, and the corresponding packet body pointers are recorded in the pointer FIFO according to the writing order of the remaining packet bodies.
5. The switching chip architecture for improving message processing performance according to claim 1, characterized in that, The message parsing and processing module (2) is also configured as follows: The message header is parsed and processed to obtain the message description file; The message descriptor includes: the message length, destination port, and priority of the message to be processed; The system also includes: Cache management module (6), wherein the cache management module (6) is configured as follows: The message description is received and sent to the exit scheduling module (4); the exit scheduling module (4) is configured to determine the sending order and sending time of the message to be processed according to the message description.
6. A switching chip architecture for improving message processing performance, applied to a switching chip, wherein the difference between the input / output bandwidth and the core processing bandwidth of the switching chip is not within a preset difference range, characterized in that, include: The ingress scheduling module (1), message parsing and processing module (2), shared cache module (3), outgress scheduling module (4), and merging module (5); the outgress scheduling module (4) includes: packet header scheduler (41) and packet body scheduler (42); The entry scheduling module (1) is configured as follows: Receive the message to be processed and segment the message into a header and a body; The message header is sent to the message parsing module (2). After the message header is parsed and processed by the message parsing and processing module (2), the packet header is obtained. Write the packet header and the packet body into the shared cache module (3); The packet head scheduler (41) is configured as follows: Send the packet header pointers for the current round and the next round to the shared cache module (3), and read the packet headers for the current round and the next round; The packet body scheduler (42) is configured as follows: The pointer to the first packet body of the message to be processed is sent to the shared cache module (3), the first packet body of the message to be processed is read, and the header and the first packet body of the current round are written into the merging module (5); after the header and the first packet body of the message to be processed in the current round and the next round are read, the remaining packet bodies are read in sequence using the packet body pointer; The merging module (5) is configured as follows: The packet header and the first packet body described in this round are merged and then serially output to the target port along with the remaining packet body.
7. The switching chip architecture for improving message processing performance according to claim 6, characterized in that, The shared cache module (3) includes: Several input ports (31), shared buffer unit (32); The input port (31) is configured as follows: Read the package body and configure the cache ID; Record the cache ID of the first packet body, and configure the cache ID of the packet header based on the cache ID of the first packet body, so that the first packet body and the packet header are cached in different shared cache units (32); After the packet body and the packet header have been read, based on the cache ID, the packet header and the packet body are written into the corresponding shared cache unit (32) using a multiplexer.
8. The switching chip architecture for improving message processing performance according to claim 7, characterized in that, The first preset number of first shared cache units in the shared cache module (3) are fixedly used to cache the package header; the second preset number of second shared cache units in the shared cache module (3) are fixedly used to cache the package body; the third preset number of third shared cache units in the shared cache module (3) are fixedly used to cache either the package body or the package header, wherein when the third shared cache unit is used to store the package body, the package header cannot be stored; when the third shared cache unit is used to store the package header, the package body cannot be stored. The shared cache unit (32) is configured with a packet header pointer and a packet body pointer; the packet header pointer and the packet body pointer are used to locate the cache addresses of the packet header and the packet body; The shared cache unit (32) is also configured with a pointer FIFO; the pointer FIFO is used to record the corresponding packet body pointer according to the writing order of the remaining packet bodies when the remaining packet bodies are written into the shared cache unit (32); The export scheduling module (4) is further configured as follows: The packet body is read sequentially using the pointer FIFO.
9. A switching chip architecture for improving message processing performance according to claim 8, characterized in that, The packet header pointer and the first packet body pointer of the message to be processed are not recorded in the pointer FIFO; The shared cache unit (32) is configured as follows: The first packet body and the packet header are cached according to the cache ID of the first packet body of the message to be processed and the cache ID of the packet header, and the corresponding packet header pointer and the first packet body pointer of the message to be processed are sent to the outgoing scheduling module (4). After the packet header and the first packet body of the message to be processed are cached, the remaining packet bodies are cached, and the corresponding packet body pointers are recorded in the pointer FIFO according to the writing order of the remaining packet bodies.
10. A switching chip architecture for improving message processing performance according to claim 6, characterized in that, The message parsing and processing module (2) is also configured as follows: The message header is parsed and processed to obtain the message description file; The message descriptor includes: the message length, destination port, and priority of the message to be processed; The system also includes: Cache management module (6), wherein the cache management module (6) is configured as follows: The message description is received and sent to the exit scheduling module (4); the exit scheduling module (4) is configured to determine the sending order and sending time of the message to be processed according to the message description.