Network performance management engine

By deploying packet interceptors and Envoy filters in cloud computing networks, the problem of insufficient packet-level latency visibility in traditional systems is solved, enabling detailed latency data capture and automated fault detection, thereby improving the efficiency and reliability of network management.

CN122268751APending Publication Date: 2026-06-23EBAY INC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
EBAY INC
Filing Date
2025-12-18
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Traditional network management systems struggle to provide accurate visibility and real-time monitoring of packet-level latency in complex and dynamic cloud computing environments, making troubleshooting and performance optimization difficult, especially in the absence of fine-grained data.

Method used

Deploy packet interceptors (such as the eBPF extended Berkeley packet filter program) and custom Envoy filters to capture and record the packet processing duration at various points within the network, providing detailed latency data through path analysis graphs, and supporting automatic remediation and fault detection.

Benefits of technology

It improves the fine-grained visibility of network management systems, facilitates troubleshooting and performance optimization, and ensures reliable and efficient network operation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122268751A_ABST
    Figure CN122268751A_ABST
Patent Text Reader

Abstract

Methods, systems, and computer storage media are described that provide a network management engine in a cloud computing system. The network management engine is an end-to-end system responsible for overseeing, monitoring, and optimizing the entire lifecycle of network traffic from initial client request to final data response. By capturing, analyzing, and managing network packets, the network management engine can ensure seamless data flow and track key performance metrics to identify and resolve latency or congestion issues. The network management engine supports monitoring and fault detection for efficient data flow, capturing metrics, and analyzing performance. The network management engine includes a network packet management extension engine and a network performance management engine. The network packet management extension engine is a specialized engine designed to compute and analyze packet latency at various stages of a network path. The network performance management engine is a specialized engine that identifies and manages deviations from expected network behavior, specifically in areas where latency can occur.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments of the present invention relate to the field of computing, and more specifically to artificial intelligence systems. Background Technology

[0002] Users can interact with cloud mesh networks across various applications and services to complete network tasks. A cloud mesh network is a distributed, interconnected system of multiple cloud environments working together to provide scalable, flexible, and resilient network services across diverse geographical locations. Cloud mesh networks enable seamless communication and data sharing between various cloud platforms, allowing users to optimize resource allocation, improve data transfer speeds, and enhance fault tolerance. By leveraging a mesh architecture, cloud resources can be dynamically interconnected, ensuring high availability and redundancy. This type of network can support applications and services across various domains, such as load balancing, disaster recovery, and global content distribution. Through advanced routing protocols and automated resource management, cloud mesh networks enhance the overall performance and reliability of cloud systems, making them an ideal solution for enterprises and service providers with complex, distributed infrastructure needs. Summary of the Invention

[0003] The various aspects of the technology described herein generally relate to systems, methods, and computer storage media for providing network management engines, among others, in cloud computing systems. A network management engine is an end-to-end system responsible for overseeing, monitoring, and optimizing the entire lifecycle of network services, from initial client requests to final data responses. By capturing, analyzing, and managing network packets, a network management engine can ensure seamless data flow and track key performance metrics to identify and resolve latency or congestion issues. Network management engines utilize a layered approach to network management, monitoring, and fault detection to ensure efficient data flow, metric capture, and performance analysis.

[0004] The network management engine comprises a network packet management extension engine and a network performance management engine. The network packet management extension engine is a specialized engine designed to calculate and analyze packet latency at various stages of the network path. It enhances network packet processing by capturing detailed information at key points in the data flow. For example, during client request processing and ingress at the Transport Layer Balancer (TLB) and data inspection and logging at the Ingress Gateway (GW), the network packet management extension engine can measure packet transmission time, capture metadata, and support connection tracking by logging latency within the network. This engine supports customized processing and data logging, providing deep insights into the latency dynamics of the entire network infrastructure.

[0005] A network performance management engine is a specialized engine that identifies and manages deviations from expected network behavior—specifically, expected network behavior in areas where latency may occur. By monitoring network traffic, analyzing latency patterns, and visualizing data, a network performance management engine detects performance issues and supports automated remediation. Key features include: latency calculation and graph generation at the client level, which visualizes latency data through path analysis graphs; and automated analysis and remediation, which leverages tools to correlate latency with system resources and reroutes traffic when necessary. The network performance management engine provides a holistic view of network performance and helps to proactively resolve issues through tools such as fault analyzers.

[0006] The client can proactively initiate a request to the application gateway to receive a response including network latency data captured via the response header. The application gateway (App GW) receives and validates the incoming request, then forwards it to the transport layer balancer (TLB) based on established routing rules. The TLB distributes the client request across available backend resources, thus balancing the load. Packet interceptors (e.g., the eBPF extended Berkeley packet filter) are used for timestamp addition, capturing ingress and egress times to monitor packet flow. The first packet interceptor adds an ingress timestamp to the packet when it enters the TLB, while the second packet interceptor measures the egress time of the packet leaving the TLB, thereby calculating the total processing duration within the TLB.

[0007] Packets are encapsulated via tunneling to enable secure transmission between the TLB and the ingress gateway (GW). The ingress GW routes traffic from the TLB to the appropriate backend services and works with a third packet interceptor to record packet traversal times, thereby capturing relevant metadata for latency tracking. The third packet interceptor records the ingress timestamp of packets arriving at the ingress GW and monitors the outgress timestamp of packets leaving, enabling detailed latency measurements.

[0008] The Application Gateway Envoy (App GW Envoy) operates to pass packets to backend servers and add trace headers with latency and node metadata for network visibility and end-to-end latency tracing, before delivering annotated packets. Initially, the server processes client requests received via the Envoy server by performing backend operations, routing responses back to the Envoy server for header tracing. The server application processes client requests based on business logic, generates responses, and sends them back to the Envoy server. Then, the App GW Envoy adds a final trace header to the outgoing packets, completing the latency monitoring chain, and delivers the packets to the client.

[0009] The client extracts and computes network latency data. The client can use the path analysis graph to update logs (e.g., a log database) to support alert generation. Logs are stored in the path analysis graph generated at the client. A time-series database (TSDB) stores service metrics (e.g., latency and performance metrics over time) from the TLB and ingress GW to support analysis (e.g., via connections to path analysis data) and enable efficient data retrieval for fault analysis and trend visualization. The fault analyzer analyzes log and time-series database data to identify network anomalies, generating fault alerts when network performance or latency deviations are detected. These operations, centered on capturing, monitoring, and analyzing packet flows, are performed via the network management engine to provide end-to-end network performance management and fault detection.

[0010] This summary is provided to describe in simplified form the selection of concepts further described below in the detailed description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used to help determine the scope of the claimed subject matter. Attached Figure Description

[0011] The technology described herein will now be described in detail with reference to the accompanying drawings, in which:

[0012] Figure 1 This is a block diagram of a cloud mesh network architecture based on the technologies described in this article;

[0013] Figure 2A and Figure 2B This is a block diagram of a network management system for providing network management based on the technical aspects described herein;

[0014] Figure 2C and Figure 2D This is a block diagram of a network management system for providing network management based on the technical aspects described herein;

[0015] Figure 3 This is a block diagram of a network management system for providing network management based on the technical aspects described herein;

[0016] Figure 4A , Figure 4B and Figure 4C This document provides a first set of exemplary methods for providing network management in a network management system via a network packet management extension engine, based on the technical aspects described herein.

[0017] Figure 5A , Figure 5B and Figure 5C A second set of exemplary methods is provided for providing network management in a network management system via a network performance management engine, based on the technical aspects described herein;

[0018] Figure 6 A block diagram is provided that illustrates a computational environment suitable for implementing the technical aspects described herein for an artificial intelligence system.

[0019] Figure 7 A block diagram is provided that is suitable for implementing the technical aspects described herein; and

[0020] Figure 8 This is a block diagram of an exemplary computing environment suitable for implementing the technical aspects described herein. Detailed Implementation

[0021] Overview

[0022] Cloud computing systems provide a distributed network of remote servers hosted over the internet for storing, managing, and processing data, rather than relying on local servers or personal computers. In cloud computing systems, resources such as storage, processing power, and applications are provided as services to users via the internet. These services can be scaled as needed and are typically designed to support multi-tenant architectures, where multiple clients can securely share the same infrastructure. Cloud computing systems act as the foundation for providing seamless connectivity and distributed data processing, allowing client requests to be managed and processed over a network of interconnected resources.

[0023] Cloud mesh networks provide a networking framework for seamless interconnection and communication between multiple cloud computing resources, typically distributed across different physical and virtual environments. In a cloud mesh network, various nodes and gateways (e.g., application gateways, transport layer balancers, ingress gateways, and other network components) work together to create a unified network architecture that routes and balances traffic between distributed services. A cloud mesh network can be a mosaic mesh network (i.e., a tess cloud mesh) that might be used in geographically distributed sensor networks, where data points (or “nodes”) form repeating spatial patterns within a specific area. By establishing secure and efficient routing paths, cloud mesh networks allow data to flow between nodes with minimal latency, ensuring optimal performance and reliability.

[0024] In this way, the cloud computing system provides the underlying infrastructure to support distributed applications and storage, while the cloud mesh network manages the real-time data flow between these distributed resources. Through this network, data can flow through various processing layers (from client requests, through load balancers and application gateways to backend services), with precise control over packet processing, timestamp addition, and latency tracking. Together, the cloud computing system and the cloud mesh network ensure that each request is processed efficiently, end-to-end latency is minimized, and anomalies in network behavior can be quickly detected and resolved through real-time monitoring and fault analysis.

[0025] Traditionally, network management systems have had limitations in providing detailed, real-time visibility into packet-level latency and traffic behavior, especially in complex and dynamic environments. This limitation makes it difficult for systems to accurately detect subtle anomalies, correlate them with specific network paths, and identify root causes in large, evolving network infrastructures. For example, conventional network packet management systems typically have limitations in their ability to accurately calculate and quantify packet latency, particularly in complex and dynamic network topologies. In such networks, packets may be transmitted along different paths affected by varying degrees of congestion, routing changes, or Quality of Service (QoS) configurations—all of which can lead to significant delays. For instance, if a packet experiences congestion on one path, its latency may increase, while a packet using another path may experience minimal latency.

[0026] These factors cause network management tools to have constantly shifting objectives, making it difficult to provide precise real-time visibility into latency for each individual packet. While these tools can offer valuable insights into broad network performance trends (e.g., identifying general congestion points or large-scale latency issues), they typically lack the detailed packet-level visibility required for accurate troubleshooting. Therefore, achieving accurate, real-time monitoring of packet latency in complex, dynamic environments remains a significant challenge. For example, large enterprise networks may establish multiple redundant links between data centers. Network packet management systems may report overall latency trends but cannot determine whether a particular packet is taking a longer path due to a temporary routing change or congestion on one of the links. This lack of granularity makes it difficult for network administrators to understand the root cause of latency problems on a packet-by-packet basis.

[0027] Similarly, traditional network performance management systems have limitations in detecting anomalies and correlating them with specific node-level service metrics. Modern networks are inherently dynamic, with constantly changing service patterns, adaptive routing protocols, and fluctuating QoS settings, all adding layers of complexity. These dynamic factors make maintaining accurate visibility into network service behavior at the packet level challenging. For example, if a particular link begins to experience higher-than-expected latency, the system may need some time to detect the anomaly, especially when the engine relies on sampled data or aggregated metrics that may not capture the fine-grained details needed to identify subtle problems.

[0028] Furthermore, conventional performance management systems that only monitor aggregated data from routers or switches may miss critical nuances, such as intermittent latency spikes caused by overloaded network nodes or undetected network path failures. Detecting these anomalies becomes even more difficult with limited visibility into specific network segments or paths. Essentially, identifying and resolving performance issues in complex modern network infrastructures is a significant operational challenge without more advanced monitoring tools that provide in-depth packet-level insights and real-time relevance. For example, a sudden drop in application performance due to latency might be incorrectly attributed to a WAN bottleneck when, in reality, it could be due to a temporary misconfiguration in the routing table of a specific network node. Without precise, packet-by-packet visibility, such anomalies may go undetected, leading to delays in troubleshooting and resolution. Therefore, a more comprehensive network management system with an alternative foundation for performing network management operations can improve the computational operations and interfaces used to provide network management.

[0029] Description of the technical solution

[0030] At a high level, the network management engine in a cloud computing system oversees the entire lifecycle of network services (from client requests to data responses) by capturing, analyzing, and optimizing network packets to ensure seamless data flow. The network management engine comprises two main components: a network packet management extension engine, which calculates and analyzes packet latency at critical stages (e.g., during client request processing and data inspection); and a network performance management engine, which monitors services, detects latency deviations, visualizes data, and supports automated remediation to address performance issues and optimize network behavior.

[0031] Cloud computing networks can be based on a distributed cloud architecture (e.g., cloud mesh networking or Tess cloud mesh) that connects various services and resources in a mesh configuration, thereby enhancing flexibility and scalability. In the context of cloud computing networks, seamless communication and resource sharing between different components are enabled, allowing for dynamic load balancing across multiple cloud environments, improved fault tolerance, and optimized resource utilization.

[0032] For example, a distributed cloud architecture in a cloud computing network can consist of several Tess Application Gateways (App Gateways). Each App Gateway instance comprises an ingress gateway (GW) and a TLB (Transport Layer Load Balancer). The ingress gateway acts as an L7 load balancer, while the TLB acts as an L4 load balancer. L4 load balancers operate at the transport layer, routing traffic based on IP addresses and ports, making them highly efficient for high-throughput applications. In contrast, L7 load balancers operate at the application layer, enabling them to make more advanced routing decisions based on the content of the request (e.g., HTTP headers and URLs).

[0033] An App GW instance can be equated to a single hardware load balancer. Each App GW instance consists of M ingress GW instances and N TLB nodes. An App GW instance can host multiple VIPs. Requests for any VIP can enter through one of the TLB nodes and then be forwarded to one of the ingress GW nodes via an IP-in-IP tunnel before being routed to the server endpoint. The response is sent directly back to the client from the ingress GW node. Incoming TLS connections to the VIP terminate at one of the ingress GW nodes. A new, persistent connection is established from the ingress GW node to the service backend pod. This connection terminates at the Envoy proxy instance within the pod, which is also a mesh component.

[0034] refer to Figure 1 , Figure 1 A cloud computing system 100 with a cloud mesh network 100A architecture (e.g., a Tess App GW architecture) is illustrated. This cloud mesh network 100A architecture is designed to efficiently manage incoming requests while providing secure and reliable access to backend services. At the heart of this architecture is the concept of modular components that work together to facilitate seamless communication from client to service endpoint. At the forefront of this architecture is client 120, which represents a user or system initiating a request for a service. Client 120 operates through border gateway protocol components (e.g., BGP 130) used for efficient routing of traffic between different networks. Client 120 is the source of a request intended to access a service hosted behind App GW 140. The client targets a specific Virtual IP (VIP) hosted by App GW 140, thus marking an entry point associated with cloud mesh network 110. When a request is made, it first encounters App GW 140, which acts as a hub for managing these incoming requests, assigning them to the appropriate services, and providing visibility and routing.

[0035] Each App Gateway instance consists of several Ingress Gateway (GW) 146 instances and Transport Layer Balancer (TLB) 142 nodes. Ingress Gateway (GW) 146 is the gateway node where incoming TLS connections are terminated and routed to backend service pods. TLB nodes operate to distribute incoming requests across available Ingress Gateways, ensuring load balancing and preventing any single node from becoming overloaded. Requests are sent to the VIP ingress through one of these TLB nodes, which then forward the request to the Ingress Gateway node via an IP tunnel (e.g., tunnel 144). This tunneling mechanism provides a secure and efficient way to transfer data between components.

[0036] When requests reach the ingress GW node, they encounter a TLS termination process, in which the incoming secure connection is decrypted. Here, a persistent connection is established with the backend service pod; specifically, the App GW Envoy 148 is connected to the server Envoy 152. Server Envoy 152 can be an Envoy proxy instance operating within the application pod, thereby managing communication between the ingress GW 146 and the backend services represented by server 150 and server application 154.

[0037] Server application 154 is the application where the core business logic resides. It processes requests and generates responses, then relays those responses back through established channels. The response follows the same path in reverse: from the server application through server Envoy 152 back to ingress GW 146; but from ingress GW 146 directly to client 120. While this architecture handles business logic and ensures secure communication, it faces significant challenges in terms of visibility. Currently, metrics about request durations are captured at ingress GW 146 and Envoy proxy 148, but detailed insights into per-hop latency and the specific transmission path of requests within the network remain difficult to obtain. This lack of visibility complicates the troubleshooting process, making it difficult for teams to identify hotspots or latency issues within the mesh.

[0038] Understanding how data packets traverse a cloud mesh network architecture110 and the time they spend at each step can be challenging. The lack of visibility into per-hop latency and the transmission path of requests presents significant challenges to cloud mesh network architectures. This issue complicates diagnosing network performance problems, optimizing resource allocation, and ensuring reliable service delivery.

[0039] For example, when a request is made to a service VIP (virtual IP address), the request traverses multiple nodes within the network, including TLB nodes and the ingress GW node. However, without detailed insights into the time spent at each hop, it is difficult to identify where latency is occurring. This lack of granular data makes it difficult to pinpoint performance bottlenecks, making troubleshooting a complex and time-consuming task.

[0040] Furthermore, the inability to track the exact path of requests through the mesh means that hotspots (i.e., areas experiencing higher than normal traffic or latency) cannot be easily identified and addressed. This can lead to performance degradation and impact the overall reliability of the network. For a network to be reliable, requests must be delivered continuously and promptly, and unexpected delays can undermine this reliability.

[0041] As an example, a user accesses a web application hosted on the Tess Cloud Mesh. A request is sent from the user's browser to a service VIP managed by the Tess Cloud Mesh. This request traverses several components before reaching the server endpoint responsible for processing it. Initially, the request is routed to a TLB node, where load balancing is handled at Layer 4. From there, the request is forwarded via an IP tunnel to one of the ingress GW nodes. At the ingress GW node, which acts as an Layer 7 load balancer, the request is processed and sent to the appropriate server endpoint. A connection is established to the service's backend pod, and an Envoy proxy within that pod takes over processing the request. When it reaches the server endpoint, the request is processed and a response is generated, which is then sent back to the user's browser via the same path. In this scenario, there are several possible latency points: the time it takes for the request to travel from the TLB node to the ingress GW node; the processing time at the ingress GW node; the duration required for the request to reach the server endpoint and generate a response; and the time required for the response to travel through the ingress GW node and the TLB node back to the user's browser.

[0042] In the absence of visibility into the latency of each of these hops, identifying the source of the latency becomes challenging. For example, if a user experiences slow response times, possible causes could include: extended processing times at the TLB node; latency within the IP tunnel between the TLB and the ingress GW node; a bottleneck at the ingress GW node; or slow processing at the server endpoint.

[0043] The lack of visibility into per-hop latency and the transmission path of requests through the Tess cloud mesh poses significant challenges in diagnosing problems, optimizing performance, and ensuring reliability. The proposed solution aims to address these challenges by providing detailed insights into network performance, thereby improving overall network efficiency and reliability. Specifically, the proposed solution involves deploying packet interceptors (e.g., eBPF extended Berkeley packet filter program) and custom Envoy filters to measure and record the duration of packet processing at various points within the network. By capturing detailed metrics about the path and latency of requests, the system aims to make the path from the client to the server endpoint visible. This visibility will enhance the ability to diagnose problems, optimize performance, and ensure reliable network operation.

[0044] By implementing the proposed solution, which includes deploying packet interceptors (e.g., extending the Berkeley packet filter "eBPF") and customizing Envoy filters, detailed metrics about the path and latency of requests can be captured. For example, an eBPF program attached to a TLB node can measure the time spent on packet processing and forwarding. Similarly, an attached eBPF program at the ingress GW node can track the time spent on request processing and forwarding to the server endpoint.

[0045] With these metrics in place, if users report slow response times, network administrators can quickly identify whether the latency is occurring at the TLB node, the ingress GW node, or the service endpoint. This enhanced visibility facilitates more efficient troubleshooting and optimization, thereby ensuring reliable and efficient network operation.

[0046] For example, in complex networks, data packets can take various paths based on routing protocols, congestion, and configuration. Consider, for instance, a request from a New York user to a service hosted in a California data center. The packet may pass through several nodes (such as routers and gateways) before reaching its destination. Without a clear understanding of how long each hop takes, it's difficult to identify where latency is occurring. For example, if the first hop from the user's device to the local router takes 5ms, but the next hop to the regional data center takes 50ms due to congestion, without tracking this, the network team might assume the problem lies at the service endpoint, unaware that the latency is occurring earlier in the routing process.

[0047] To optimize performance, it's essential to understand the exact path a request takes from start to finish. This means knowing which nodes the request passes through and how long it takes at each node. By using a system that tracks requests, network administrators can understand how a request travels from the client to the ingress gateway (GW), then to the load balancer, and finally to the TLB nodes before reaching the server endpoint. If the load balancer introduces a 40ms latency, this information allows the team to focus on optimizing that specific component.

[0048] Detailed metrics are crucial for diagnosing problems and improving network performance. These metrics should not only show average latency but also identify patterns and anomalies. If metrics reveal that latency at the TLB node consistently spikes to 100ms during peak periods, the network team can investigate further. They might discover that the node is overloaded and needs scaling, or that there are configuration issues impacting performance.

[0049] Example systems and resources

[0050] Examples and references are available. Figure 2A , Figure 2B , Figure 2C and Figure 3 To describe aspects of the technical solution. Figure 2AA cloud computing system 100 is illustrated, comprising: a cloud mesh network 110, a network packet management extension engine 110A, and a network performance management engine 110B; a client 120, a Border Gateway Protocol (BGP) 130, an Application Gateway (App GW) 140, a Transport Layer Balancer (TLB) 142, packet interceptors 142A and 142B, a tunnel 144, an ingress gateway (ingress GW) 146, a packet interceptor 146A, an application gateway envoy (App GW envoy) 148, a server 150, a server envoy 152, a server application 154, a log 160, a time series database (TSDB) 170, and a fault analyzer 180. The cloud computing system 100 corresponds to the following reference. Figure 6 The described list of projects is associated with 600 cloud computing systems.

[0051] The Network Packet Management Extension Engine 110A and the Network Performance Management Engine 110B are collectively referred to as the Network Management Engine 110. The Network Management Engine 110 is an end-to-end system responsible for supervising, monitoring, and optimizing the entire lifecycle of network services, from initial client requests to final data responses. By capturing, analyzing, and managing network packets, the Network Management Engine ensures seamless data flow and tracks key performance metrics to identify and resolve latency or congestion issues.

[0052] The Network Packet Management Extension Engine 110A is a specialized engine designed to calculate and analyze packet latency at various stages of the network path. It enhances network packet processing by capturing detailed information at key points in the data flow. For example, during client request processing and ingress at the TLB, and data inspection and logging at the ingress gateway (GW), the 110A can measure packet transmission time, capture metadata, and support connection tracking by logging latency within the network. This engine supports customized processing and data logging, providing deep insights into the latency dynamics of the entire network infrastructure.

[0053] The Network Performance Management Engine 110B is a specialized engine that identifies and manages deviations from expected network behavior—specifically, expected network behavior in areas where latency may occur. The Network Packet Management Extension Engine 110A detects performance issues and provides AI-powered automated remediation by monitoring network traffic, analyzing latency patterns, and visualizing data. Key features include: latency calculation and graph generation at the client end, which visualizes latency data through path analysis graphs; and automated analysis and remediation, which leverages tools to correlate latency with system resources and reroutes traffic when necessary. The Network Performance Management Engine 110B provides a holistic view of network performance and helps to proactively resolve issues through tools such as the Fault Analyzer 180.

[0054] The Network Management Engine 110 utilizes a layered approach for network management, monitoring, and fault detection. Each component plays a role in ensuring efficient data flow, capturing metrics, and analyzing performance. Below is a detailed breakdown of each component, including its function, the data it processes, its interface, and key operations.

[0055] Client 120 initiates a network request to access a service or application. Client 120 may represent an end user or an automated system making the request to the backend. Client 120 transmits the outgoing network request, which includes a header, source IP / destination IP, and payload. Client 120 sends the request to a network entry point (e.g., App GW 140). Client 120 initiates a connection request and waits for a response, optionally capturing the response header and latency data for analysis.

[0056] BGP 130 (Border Gateway Protocol) determines the optimal path for packets throughout the network. In this architecture, it ensures packets reach backend services along efficient and reliable paths. BGP 130 can include routing tables associated with network path metrics. BGP 130 connects to routers and network gateways to share routing information and improve path selection. BGP 130 dynamically updates routing paths based on network conditions, minimizing latency and rerouting traffic as needed to avoid congestion.

[0057] The App GW 140 provides the entry point for client services. It filters, manages, and routes traffic to internal services, such as the Transport Layer Balancer (TLB) 142. The App GW 140 accesses and processes incoming requests from clients, including IP headers and payloads. It interfaces with clients, the TLB, and other internal services. The App GW 140 also validates incoming requests, applies security policies, and forwards traffic to the TLB for load balancing.

[0058] TLB 142 is responsible for distributing incoming traffic across multiple service instances. This ensures balanced workload distribution and minimizes latency. TLB 142 processes incoming packets with headers, timestamps, and metadata. TLB 142 connects to the ingress gateway (GW) 146 via tunnel 144. TLB 142 distributes requests to backend services based on load, availability, and other performance factors. TLB 142 hosts packet interceptors 142A and 142B to timestamp packets and monitor latency.

[0059] Packet interceptors 142A and 142B provide in-core monitoring and data processing at the TLB level, enabling high-performance packet inspection and latency tracking. The packet interceptors track network packet headers, timestamps, and processing metrics. They are attached to the TLB's ingress and egress points to monitor packet flow. Packet interceptor 142A adds a timestamp to incoming packets at the ingress point, and packet interceptor 142B records the exit time at the egress point. Together, they calculate the processing duration within the TLB.

[0060] Tunnel 144 securely transmits packets between TLB 142 and ingress GW 146. This encapsulated path prevents interference and maintains data integrity. Tunnel 144 processes encrypted packet payloads with headers and connects the TLB to the ingress GW. Tunnel 144 provides a secure channel for network traffic, thereby maintaining isolation and data protection as packets pass through this infrastructure.

[0061] The ingress gateway (ingress GW) 146 manages incoming traffic from the TLB 142 and forwards it to the appropriate service instance. The ingress GW 146 processes network packets with encapsulated metadata and communicates with the App GW Envoy 148 and the TLB 142. The ingress GW 146 directs packets to specific application instances or proxies, thereby attaching a packet interceptor 146A to further monitor processing time and extract relevant data for latency analysis.

[0062] Packet interceptor 146A operates within ingress GW 146 to record timestamps as packets flow through and collect metadata. Packet interceptor 146A processes packet metadata (e.g., timestamps and source IPs) and monitors ingress and egress traffic within ingress GW 146. Packet interceptor 146A tracks the traversal time of packets from entering to leaving ingress GW 146, thereby calculating latency and capturing relevant metrics for analysis.

[0063] The Application Gateway Envoy (App GW Envoy) 148 is an application gateway-level Envoy agent that attaches metadata to packets to create trace headers for end-to-end latency monitoring. The App GW Envoy 148 processes packets with trace headers and latency metric annotations and communicates with the ingress GW to add data to response packets passed from backend servers. The App GW Envoy 148 inserts trace headers into each packet, providing hop-by-hop visibility into network latency and duration at each node.

[0064] Packet latency data is associated with packets, which refers to the collected metrics and contextual details that quantify the time a packet takes to travel through a specific network element and (by extension) the entire network path. In addition to their kernel-level representation, these metrics can also be displayed at the application layer via HTTP headers inserted by intermediate proxies such as Envoy. In other words, the same latency data initially captured in low-level data structures is transformed into a human-readable form appended to the HTTP header of response packets, thus providing end-to-end visibility of per-hop latency and network performance bottlenecks to clients and downstream systems.

[0065] For example, the measured TLB duration, along with other time metrics recorded at different hops, can be injected into the HTTP header before sending the response packet back to the client. These headers (which may include identifiers such as X-CORP-MESH-TLB-DURATION or similar identifiers) allow the receiving end to gain a clear hop-by-hop understanding of the network latency encountered. Therefore, latency data collected at the mapping level can be transformed into meaningful application-layer insights, enabling clients or observability tools to more accurately analyze and respond to network performance issues.

[0066] Packet latency data can be stored in a kernel-level data structure (specifically, a mapping) that associates the identification information of a specific network flow with critical latency measurements recorded at the Transport Layer Balancer (TLB). This data structure operates using a key-value paradigm. The key consists of the source IP address and source port of the internal packet, uniquely identifying a given flow or connection attempt. The corresponding value includes the tunnel source IP and the measured TLB duration. The tunnel source IP identifies the specific TLB host forwarding the packet, while the TLB duration quantifies the time the packet spends traversing the TLB node. By pairing these elements, packet latency data not only records the time a packet spends in a critical network component (TLB) but also associates that time with a specific source endpoint and the TLB nodes the packet traversed.

[0067] Packet latency data provides fine-grained insights into per-hop latency and network behavior. It enables downstream processing components (such as Envoy agents or other monitoring tools) to retrieve, analyze, and annotate network flows with precise latency information. This allows operators and automated systems to isolate slow network segments, identify performance bottlenecks, and gain detailed hop-level understanding of the transmission paths of packets through the cloud mesh.

[0068] Server 150 processes client requests and performs backend functions based on the received data. Server 150 processes incoming requests including packet payloads and then connects to Server Envoy 152 and Server Application 154. Server 150 provides backend services and data processing in response to client requests. Server Envoy 152 is a server-level Envoy proxy. Server Application 154 resides at the application layer, where client requests are processed according to business logic. Server Application 154 processes request and response payloads and communicates with Server 150 and Server Envoy 154. Server Application 154 performs core operations to generate a response to be sent back to Client 120. In the routing to the client, App GWEnvoy 148 adds a final trace header and timestamp to outgoing packets, providing full visibility into latency and network path.

[0069] Log 160 stores detailed records of network events, metrics, and errors for subsequent analysis. Log 160 data includes timestamped logs that capture network activity, packet metadata, and latency measurements. Fault Analyzer 180 can access Log 160. Log 350 collects and organizes logs, providing a historical record of network performance and events. Time Series Database (TSDB) 160 stores timestamped network data (e.g., service metrics from TLB 142 and ingress GW 146), providing a structured dataset for latency and performance analysis. TSDB 360 data includes time-series metrics, including latency, packet flow, and error rates. TSDB 360 aggregates metrics over a period of time, enabling efficient retrieval for analysis, visualization, and historical comparison. TSDB 170 works with Fault Analyzer 180 to support alert generation.

[0070] The fault analyzer 180 identifies performance issues by analyzing logs and metrics from the TSDB. The fault analyzer 180 processes time-series data, log entries, and fault metrics. Specifically, the fault analyzer 180 receives data from logs 160 and TSDB 170 and generates alerts based on detected network performance anomalies, thus issuing alerts when latency exceeds a threshold, enabling proactive management and troubleshooting.

[0071] This layered setup creates a framework for monitoring, analyzing, and optimizing network performance across the entire complex service infrastructure. From initial client requests to backend processing and performance analytics, each component plays a distinct role in ensuring reliable, high-performance network operations.

[0072] refer to Figure 2B , Figure 2BA schematic diagram 100B is shown associated with a network management engine provided according to embodiments described herein. The process is initiated by client 120, which sends a network request to a specific service. This request traverses multiple layers and components, each carefully configured to optimize network performance and enhance visibility into network state. Each component in this architecture is integrated to leverage advanced mechanisms such as BGP 130 and packet interceptors across different nodes for data collection, latency analysis, and fault detection.

[0073] Upon initiation, client 120 requests are routed to App GW 140, which processes the incoming traffic and distributes it evenly across available servers in the system. App GW 140 acts as the entry point for routing packets to appropriate backend services. Next, the request is forwarded to TLB 142, which is responsible for distributing network traffic across multiple application instances to ensure optimal load balancing and low latency. For example, when an HTTP packet enters App GW 140, it first passes through the IP layer and is encapsulated in TCP (Transmission Control Protocol) for reliable delivery. The packet then passes through XDP (eBPF Fast Data Path), where filtering or redirection can be performed for high-performance packet processing. From there, the packet is handed over to the Traffic Control (TC) system, which can enforce policies such as bandwidth shaping or prioritization before forwarding the packet to the IPVS (IP Virtual Server) load balancer, which distributes traffic to appropriate backend servers based on the configured load balancing method. Throughout this path, the payload (application data) remains constant, and the total processing time is minimized by efficiently processing each network layer.

[0074] Within TLB 142, eBPF programs 142A and 142B are deployed to monitor packet flows at both the ingress and egress points. These eBPF programs are kernel-based technologies designed for packet filtering and monitoring in high-speed environments. eBPF 142A is attached to the ingress traffic path to timestamp each incoming packet, creating a baseline for measuring latency. eBPF 142B is then attached to the egress traffic path to measure the time packets spend within the TLB node, providing an initial measurement of network latency.

[0075] Traffic flows from the TLB through tunnel 144, which links the TLB to the ingress gateway (GW) 146. This tunnel is designed to protect data transmission between the TLB and the ingress gateway, ensuring that traffic remains isolated and optimized as it traverses the network infrastructure. Packets arriving at the ingress GW 146 after passing through the tunnel are encapsulated in multiple layers. It begins with an IP header, followed by another IP header (for tunneling), then the duration (DU), a TCP segment, and its corresponding payload (application data). The entire packet is forwarded through the tunnel, where the external IP header processing routes it to the ingress gateway, which decapsulates the tunnel and forwards the packet to its destination for further processing.

[0076] Upon arrival at the ingress gateway GW 146, another eBPF program (i.e., eBPF 146A) is attached to track packets as they enter and leave the gateway. This eBPF program is configured to monitor key metrics such as packet arrival time, source IP, and transmission duration, adding another layer of latency monitoring within the gateway node. Packets passing through ingress gateway 146 are first processed at the XDP (eBPF Fast Data Path) level, where they are filtered or redirected for high-performance processing before reaching the kernel's network stack. From there, the packets pass through Traffic Control (TC), which applies policies such as traffic shaping or prioritization to manage the flow and ensure Quality of Service (QoS). The packets are then routed at the IP layer (where the packet's destination is determined), and encapsulated where necessary to tunnel or forward them to the appropriate backend.

[0077] Finally, Envoy, acting as a service broker (i.e., App Gateway Envoy 148), can process the packet at the application layer, performing tasks such as load balancing and routing, and applying any additional network policies before the packet reaches its intended service or application. Envoy operates by logging the source IP address, source port, and destination IP address associated with each new TCP connection. To achieve this, an Envoy network filter is introduced and executed whenever a new TCP connection is established.

[0078] During initialization, the network filter queries the kernel to retrieve the TLB IP address and TLB duration associated with the {source IP, source port} key. Once this information is obtained, it is stored in memory for reference during response processing. After the lookup, the corresponding kernel entry is immediately deleted to maintain a clean state.

[0079] To prevent the accumulation of outdated data, the garbage collector within Envoy periodically removes obsolete mapping entries. For example, the garbage collector can run every two minutes and remove entries older than 90 seconds. This approach ensures efficient memory usage and prevents potential memory leaks.

[0080] Turn Figure 2A When a packet arrives at server Envoy 152, the Envoy proxy inspects the packet before it reaches server application 154. Here, the actual server application 154 processes the client's request, thus handling operations specific to the requested service. After processing, the server application's response is routed back through server Envoy 162 and then sequentially through App gateway Envoy 148 and ingress GW 146 to client 120.

[0081] The client extracts and calculates network latency data. The client can use path analysis graphs to update the log database to support alert generation. A path analysis graph is a visual representation that depicts the different paths packets take in the network and provides detailed insights into latency and performance metrics at each hop. This type of graph is particularly valuable in complex network environments where packets may traverse multiple routers, switches, and gateways, each of which may affect overall latency based on congestion, routing decisions, or Quality of Service (QoS) configurations.

[0082] When constructing a path analysis graph, data is collected at key network nodes, including entry points, intermediate routers, gateways, and destinations. Each node in the graph represents a point where metrics such as latency, packet loss rate, and jitter are measured. These nodes help pinpoint the precise location of latency or performance degradation as packets travel along their routes. For example, if packets consistently experience high latency at a particular router, the path analysis graph will highlight this, allowing network administrators to focus on resolving the issue at that node rather than across a broader network segment.

[0083] The path analysis graph also includes data on dynamic route changes and different service patterns, showing how packets might traverse different routes under varying conditions. This real-time tracking allows for precise latency calculations on each path, enabling administrators to clearly understand where congestion or anomalies are impacting network performance. Where tools rely on aggregated data, the path analysis graph adds granularity by capturing detailed metrics at each network hop, allowing for the detection and resolution of subtle anomalies that might otherwise be overlooked.

[0084] In network management, path analysis graphs provide visibility into the network's operational status, thereby correlating service flows with node-level performance data. By identifying potential problem areas at each stage of packet transmission, path analysis graphs support more efficient troubleshooting, faster resolution times, and optimized routing paths, thereby improving overall network reliability and responsiveness.

[0085] The shift to node-level service metrics refers to a set of specific measurements and performance indicators collected at each network node, which together provide fine-grained insights into data flow, resource utilization, and latency behavior. These metrics offer a localized view of the health, efficiency, and operational status of nodes within a broader distributed network environment. By collecting these node-level metrics, the system achieves precise, real-time visibility into service patterns and resource constraints that impact end-to-end network performance.

[0086] Within the network management engine, each node (e.g., a transport layer balancer, ingress gateway, or application gateway) generates and tracks metrics associated with its own activity, resource usage, and service processing. For example, packet loss at a node's NIC or within its kernel highlights where data packets may fail to be delivered, providing clues to potential congestion or capacity issues. CPU and memory utilization metrics provide a view of processing demand and availability, facilitating efficient resource allocation management. These metrics directly support the solution's goal of enabling real-time latency analysis and timely adjustments to optimize packet routing and processing.

[0087] Furthermore, metrics such as round-trip time (RTT) histograms, congestion-limited connection counts, and open TCP / UDP port counts provide detailed insights into connection health and stability. By correlating these node-level service metrics with data stored in components such as time-series databases, the solution can perform advanced analytics to detect anomalies and identify deviations from expected performance, thereby improving the accuracy and responsiveness of network management. Ultimately, node-level service metrics serve as the foundational data supporting latency monitoring, fault detection, and automated network tuning in cloud mesh networks, enabling the solution to focus on precise path-specific service control and fault resolution.

[0088] For example, node-level service metrics provide insights into the health and performance of each network node, focusing on factors such as resource utilization, connection stability, and data flow efficiency. Each metric is crucial for diagnosing network problems and optimizing service management, especially in complex systems where latency and resource constraints can impact overall performance.

[0089] Packet loss in the network interface card (NIC) and kernel reveals potential bottlenecks or interruptions in data transmission. For example, if a node displays a high packet loss rate at the NIC, it may indicate a buffer capacity issue or link overload, requiring careful examination of service routing or hardware capacity. Within the kernel, packet loss may stem from processing capacity limitations (packets being dropped before being forwarded), fundamentally impacting the reliability of data transmission.

[0090] The count of CPU cores exceeding a predefined utilization percentage (e.g., 75% utilization) within a one-second window is another important metric. This metric allows for real-time insight into processing demand across the entire node, highlighting when specific tasks or business spikes overload resources. For example, if multiple CPU cores consistently exceed 75% utilization during peak periods, the node may require load balancing or further resource allocation to prevent performance degradation.

[0091] Congestion-limited connections are those that cannot operate at full capacity due to network congestion. Tracking the number of these connections can reveal when and where data services encounter bandwidth limitations, providing a basis for adjusting QoS settings or rerouting services to less congested paths.

[0092] Round-trip time (RTT) histograms help identify latency patterns and anomalies. For example, a normal distribution of RTT can indicate stable performance, but spikes or offsets in the histogram indicate fluctuating latency—potentially pointing to routing path problems or transient congestion. Monitoring RTT histograms allows for quick identification of latency variations, which is crucial for maintaining seamless, low-latency connectivity.

[0093] Metrics for when the number of connections to Receive Memory (RMEM) and Write Memory (WMEM) exceeds configured thresholds help identify situations where data buffering exceeds the node's memory capacity. Frequently exceeding RMEM or WMEM thresholds may be an indication of insufficient buffer settings or abnormally high data rates, requiring optimization to maintain data integrity and prevent connection slowdowns.

[0094] Packet and bit rate measures the amount of data passing through a node over a period of time, allowing administrators to track overall throughput and identify unusual spikes or drops in the data flow. For example, a sudden drop in a node's packet rate may indicate an upstream routing problem, packet filtering issue, or application failure, prompting an immediate investigation.

[0095] The count of open TCP and UDP ports on a node indicates which services are active and accessible. This information is crucial for maintaining network security and efficiency, as unmonitored open ports can expose the system to unauthorized access or increased load from external sources.

[0096] Node memory utilization provides a view of available and consumed memory, helping to assess whether memory resources are sufficient for the current task. High, poorly managed memory utilization can lead to paging, ultimately slowing down data processing and packet processing.

[0097] Memory bandwidth utilization represents the utilization rate of memory access capacity, which is particularly relevant in data-intensive applications. If memory bandwidth is fully utilized, even if CPU and network resource utilization is low, it can slow down access to critical data, thus creating a bottleneck.

[0098] Finally, tracking the percentage of CPU consumed by eBPF on the host highlights the processing demands of the eBPF programs that manage packet processing and monitoring. For example, if an eBPF program is consuming a large amount of CPU resources, it may limit the resources of other processes on the node, indicating a need to optimize the efficiency of these monitoring functions.

[0099] These node-level service metrics collectively provide a comprehensive view of the network health and performance at each node. By continuously monitoring and analyzing these metrics, network administrators gain the tools to diagnose, tune, and optimize node performance under various service conditions and operational requirements.

[0100] For example, in terms of bandwidth utilization, determining whether bandwidth utilization has been met involves correlating increases in packet latency with increases in traffic load, as observed through multiple packet interceptors deployed throughout the network. As more data flows through the network, traffic patterns become more intensive, and packets begin to contend for the same resources, such as queues and transmission buffers. Packet interceptors record this increasingly intense contention as a gradual increase in latency. For example, when traffic exceeds a certain threshold, packets may begin to queue at transport layer balancers or experience slower forwarding rates at ingress gateways. These conditions are reflected in timestamped packet data: previously negligible latency increases significantly, whether at individual nodes or cumulatively along the multi-hop path of the packets. By continuously comparing current latency readings to established performance benchmarks, the system can detect subtle changes in congestion. Once the interceptors report a sustained increase in latency at multiple points in the network (specifically, points referred to as capacity-limited segments), the system infers that available bandwidth has been effectively consumed.

[0101] Throughout the network management process, logs 160 are continuously updated with information about packet flows, latency measurements, and processing durations at each network node. These logs serve as a persistent repository of network activity, crucial for troubleshooting and retrospective analysis. To manage and analyze large volumes of real-time data, TSDB 170 is utilized. This dedicated database is optimized for timestamped data, capturing metrics from each hop in the request path, including timestamps, hop-specific latency, and error rates. TSDB aggregates this information to provide an end-to-end view of network performance.

[0102] Finally, the fault analyzer 180 processes data from the logs and TSDB 170 to detect deviations from normal performance benchmarks. If latency spikes or packet loss exceed acceptable thresholds, the fault analyzer generates alerts to prompt network administrators to investigate and, if necessary, remedy the problem. By analyzing trends and identifying anomalies, the fault analyzer acts as a proactive monitoring and control layer, ensuring network reliability and efficiency.

[0103] In this way, the Network Management Engine 110 integrates advanced components and technologies—providing a comprehensive solution for network management, performance monitoring, and fault detection. This architecture enables precise latency measurement at each network node, provides end-to-end visibility, and allows for rapid response to network anomalies, thereby ensuring optimal performance and reliability of client-server interactions.

[0104] refer to Figure 2C , Figure 2C A flowchart associated with the Network Packet Management Extension Engine (NTLE) is shown. This engine enhances visibility into the mesh network by tracking request paths and hop-level latency, enabling rapid troubleshooting and potentially automated recovery for performance optimization. The NTLE provides a traceable path for request processing, monitoring latency at each node and enabling automated, real-time responses to optimize network performance and maintain service availability. The following steps can be used to illustrate path identification and TLB duration measurement:

[0105] At step 201C: Client Request Processing and Ingress at the TLB: Client request processing begins when a client (or service generator) initiates a request for a Virtual IP (VIP) associated with a specific service. This VIP is a unique IP that acts as a single entry point, guiding the request through the mesh of network components, facilitating routing, monitoring, and latency tracing. The first component receiving the request is the Transport Layer Balancer (TLB), which handles routing at the network edge to direct traffic to the appropriate service.

[0106] Deploying eBPF programs on TLB hosts: To enhance tracing and optimize latency, several eBPF programs are deployed on the TLB host. eBPF allows small programs to run in the kernel to monitor and analyze packet data, thereby minimizing performance overhead by executing only when specific business conditions are met.

[0107] eBPF 1 is attached to the Service Control (TC) hook at the ingress (the entry point of the TLB). Its function is to timestamp incoming SYN packets from the IP associated with the TLB's VIP. This timestamp helps track the exact arrival time of packets when they reach the TLB, providing a reference point for latency analysis. To optimize performance, eBPF 1 uses a per-CPU sampling method, ensuring that each CPU core processes only one SYN packet per second, reducing the load on the system.

[0108] The eBPF2 is also attached to the TC hook, but at the egress (the exit point of the TLB), it measures the duration of a packet's time within the TLB from entry to exit, including any time spent creating a network tunnel. The time spent within the TLB is calculated and written as the TLB duration to the packet's internal IP options, thus providing an accurate measure of TLB processing latency.

[0109] The eBPF 3 inspection and logging of TLB data examines tunneled packets and logs relevant metadata such as source IP, source port, TLB IP address, and TLB duration. This data is crucial for understanding where latency might be introduced during packet processing. The procedure is attached to the TC in the ingress gateway (GW) pod to log metadata for each packet, which will be referenced in subsequent connectivity analysis.

[0110] At step 202C – Ingress Gateway (GW) Processing and Connection Tracking: When a request arrives at the ingress GW, it is processed by App GW Envoy (a high-performance proxy service that routes and monitors network traffic). To track and report latency, network filters in Envoy collect the necessary metadata.

[0111] App GW Envoy integration for connection tracking: Network filters in App GW Envoy capture source IP, source port, and TLB information (e.g., TLB processing duration, TLB host IP). This information, initially logged by the eBPF program at the TLB, is now accessible to each connection, allowing Envoy to assess the processing time of a request as it traverses each layer of the network. A garbage collector function clears old entries from memory every two minutes to prevent data overflow, ensuring that only relevant, recent data is stored.

[0112] App GW Envoy's custom response filter—App GW Envoy also adds specific tracing headers to each HTTP response to capture detailed processing information. This response filter is deployed within the ingress GW pod, logging request latency at several points and attaching metadata for each layer of grouping traversal:

[0113] X-CORP-MESH-PROXY-DURATION: Indicates the processing time within the Envoy agent.

[0114] X-CORP-MESH-PROXY-POD: Identifies proxy pods by IP or FQDN, thus allowing pod-level tracing.

[0115] Additional header:

[0116] X-CORP-MESH-INGRSS-GW-DURATION: Indicates the total processing time within the ingress GW.

[0117] X-CORP-MESH-INGRSS-GW-POD: Identifies the entry point GW pod.

[0118] X-CORP-MESH-TLB-HOST: Captures the source IP retrieved from local storage from the TLB.

[0119] X-CORP-MESH-TLB-DURATION: Reflects the time spent by the TLB processing the request.

[0120] At step 203C—Latency Calculation and Graph Generation at the Client: Upon receiving the HTTP response, the client extracts the trace headers added by the App GW Envoy in the ingress GW. These headers provide a complete description of the latency at various network points:

[0121] The client captures headers such as X-CORP-MESH-PROXY-DURATION and X-CORP-MESH-TLB-DURATION, each of which indicates latency at critical nodes (e.g., TLB, ingress gateway, Envoy proxy).

[0122] The client records this data as a JSON object, thus allowing for a structured representation of the network path and latency for each node:

[0123] {

[0124] "nodes":[

[0125] {"id":1, "label":"client"},

[0126] {"id":2, "label":"tlb"},

[0127] {"id":3, "label":"ingress-gw"},

[0128] {"id":4, "label":"envoy-proxy"},

[0129] {"id":5, "label":"svc-endpoint"}

[0130] ],

[0131] "edges":[

[0132] {"id":1, "label":"client_to_tlb", "latency_in_millis":<CtoTlb(t)>},

[0133] {"id":2, "label":"tlb_to_ingrss_gw", "latency_in_millis":<TlbToIgw(t)>},

[0134] {"id":3, "label":"ingrss_gw_to_envoy_proxy", "latency_in_millis":<IgwToProxy(t)>},

[0135] {"id":4, "label":"envoy_proxy_to_svc_endpoint", "latency_in_millis":<ProxyToSvc(t)>} ]

[0137] At step 204C—Automated Analysis and Remediation: Tools (e.g., fault analyzers) analyze JSON latency data, correlating it with kernel metrics to identify any bottlenecks. If network congestion, resource exhaustion, or unusual latency spikes are detected, these tools can initiate adjustments via a VIP scheduler that dynamically reroutes traffic and allocates resources to improve network performance and ensure reliability. It is conceivable that, for efficient operation within the network, root access to the AppGW Envoy could be enabled to support BPF capabilities. If security protocols restrict root access, alternative solutions involve implementing a remote RPC (Remote Procedure Call) proxy; however, this could significantly increase latency due to additional routing overhead.

[0138] To measure network latency without affecting packet flow, the duration is included in the IP options using the timestamp IP option, a feature specified in the IP standard for experimental purposes. This option is chosen (and is rarely used in typical network configurations) because it is compatible with latency measurement. By attaching it only to TCP SYN packets, any additional performance overhead can be minimized; this method avoids the costly adjustments required for packet header space, which can slow down high-speed packet processing.

[0139] Timestamps are applied only to traffic pointing to a Virtual IP (VIP) moving between the Transport Layer Balancer (TLB) and the Ingress Gateway (GW). This selective approach ensures that the IP option remains hidden from other network devices along the path, avoiding unnecessary exposure while maintaining its optimization. Although including timestamps introduces some lookup costs within the kernel, these costs can be minimized by using a per-CPU hash mapping that allows lock-free lookups. This architecture reduces bottlenecks, enabling efficient latency tracking.

[0140] Within this framework, the design measures packet latency from the host to the customer namespace, through the IPVS load balancer, and down to the egress point in the Linux kernel. The timestamp IP option is then removed at the lowest level of the Linux network stack at the ingress GW to prevent it from appearing in responses to clients. Without this removal step, the IP option would be visible on every device along the packet's network path, potentially increasing latency as some devices handle IPv4's IP option on slower paths. Timestamps can also increase packet size, risking fragmentation and further latency. To prevent this, timestamps are specifically stripped at the Traffic Control (TC) hook in the customer network namespace, an efficient choice that avoids the significant performance cost of attaching the Fast Data Path (XDP) to the virtual Ethernet (veth) interface in the customer namespace.

[0141] Further performance optimizations are achieved by sampling to CPU, reducing the timestamp update frequency between the TLB and the ingress GW by a factor of 16. This implementation significantly reduces the processing load on the system without compromising the accuracy of latency tracking. Through this combination of targeted timestamp usage, efficient data processing, and strategic processing points, the design achieves detailed latency measurement while maintaining high network performance.

[0142] refer to Figure 2D , Figure 2DA flowchart associated with a sample implementation of network management is shown, illustrating how eBPF, Envoy filters, and header metadata are used to measure hop-by-hop latency, achieving hop-level visibility for detailed analysis and automated troubleshooting. In operation, requests to access VIP services are initiated by the client (or the synthetic business generator). This request is routed through various components in the Tess cloud mesh, where hop-by-hop latency is measured and metadata is collected to aid troubleshooting and improve visibility.

[0143] At 201D—TLB request processing: Client requests are directed to the Virtual IP (VIP) address associated with the service. The request first reaches one of the TLB (Transport Layer Balancer) nodes. Multiple eBPF (Extended Berkeley Packet Filter) programs are deployed on the TLB host to monitor the request. These programs can be activated only for traffic flowing from the TLB to the ingress GW (Gateway) node, thus minimizing overhead.

[0144] The first eBPF procedure: At the TLB, activate the eBPF procedure hooked to the TC (Traffic Control) at the ingress. Add a timestamp IP option to incoming SYN packets directed to the TLB VIP. This timestamp records the exact time the packet entered the TLB host. To reduce processing costs, adding timestamps is limited to processing only one SYN packet per CPU core per second.

[0145] The second eBPF procedure, also attached to the TC hook, measures the time a packet spends within the TLB host kernel from entry to exit, including tunnel creation time. The calculated TLB duration is then written to the timestamp field in the packet's internal IP options, providing a TLB-level latency measurement.

[0146] At step 202D—Data Inspection and Logging at the Ingress GW: After leaving the TLB node, the request is forwarded to the ingress GW node. At the ingress GW, a third eBPF procedure inspects the packet used for data tracking. This procedure stores the packet's metadata, including the source IP, source port, TLB IP address, and TLB processing duration. This data is mapped using BPF (Berkeley Packet Filter). The key consists of the source IP and port of the internal packet, while the value includes the TLB duration and TLB IP. The network filter in the Envoy agent retrieves the TLB metadata and stores it in memory for reference in response processing for additional latency metrics.

[0147] At step 203D—Latency Measurement at the Envoy Proxy: The request is processed by the App GW Envoy proxy in the service's application pod. Within the App GW Envoy proxy, further latency information is measured. App GW Envoy adds an HTTP header (X-CORP-MESH-PROXY-DURATION) to capture the time spent by the endpoint service processing the request. Another HTTP header (X-CORP-MESH-PROXY-POD) is added to indicate the App GW Envoy proxy's IP address or fully qualified domain name (FQDN) for traceability.

[0148] At step 204D—a custom response filter at the entry GW: At the entry GW, the App GW Envoy response filter records the request duration. This filter uses TLB metadata stored in memory to collect the following HTTP headers and add them to the response:

[0149] X-CORP-MESH-INGRSS-GW-DURATION: Total duration at the entrance GW.

[0150] X-CORP-MESH-INGRSS-GW-POD: The IP or FQDN of the entry GW pod.

[0151] X-CORP-MESH-TLB-HOST: Source IP from TLB.

[0152] X-CORP-MESH-TLB-DURATION: Processing duration at the TLB.

[0153] The collected data is stored in a BPF map, where old entries are periodically evicted to prevent memory leaks.

[0154] At step 205D: Client response processing: Upon receiving a response, the client extracts and records the relevant headers to calculate the per-hop latency.

[0155] X-CORP-MESH-PROXY-DURATION: The time spent by the server endpoint.

[0156] X-CORP-MESH-INGRSS-GW-DURATION: Time from entry GW to service.

[0157] X-CORP-MESH-INGRSS-GW-POD, X-CORP-MESH-PROXY-POD, X-CORP-MESH-TLB-HOST, and X-CORP-MESH-TLB-DURATION are used for tracking purposes.

[0158] By analyzing these headers to calculate the per-hop latency, these headers reveal the per-hop latency and any delays along the path.

[0159] Step 206D: Visualization and Tool-Driven Analysis

[0160] Aggregate the recorded metrics into, for example, a JSON document:

[0161] {

[0162] "nodes":[

[0163] {"id":1, "label":"client"},

[0164] {"id":2, "label":"tlb"},

[0165] {"id":3, "label":"ingress-gw"},

[0166] {"id":4, "label":"envoy-proxy"},

[0167] {"id":5, "label":"svc-endpoint"}

[0168] ],

[0169] "edges":[

[0170] {"id":1, "label":"client_to_tlb", "latency_in_millis":<CtoTlb(t)>},

[0171] {"id":2, "label":"tlb_to_ingrss_gw", "latency_in_millis":<TlbToIgw(t)>},

[0172] {"id":3, "label":"ingrss_gw_to_envoy_proxy", "latency_in_millis":<IgwToProxy(t)>},

[0173] {"id":4, "label":"envoy_proxy_to_svc_endpoint", "latency_in_millis":<ProxyToSvc(t)>} ]

[0175] }

[0176] The AI ​​tool can parse the JSON document and combine it with kernel-level metrics to identify bottlenecks or resource limitations. When resource exhaustion or latency spikes are detected in the grid, the tool (e.g., the AI ​​tool) can trigger automatic remediation actions via the VIP scheduler, ensuring high availability and optimized performance.

[0177] refer to Figure 3 , Figure 3 The end-to-end implementation of the network management engine is shown, providing several key capabilities to enhance network performance and visibility. One of its main features is end-to-end latency tracking, which adds a timestamp option to the IP header, allowing for precise measurement of each segment in the packet path. This capability ensures accurate monitoring of latency for each hop. Another important feature is optimized sampling and filtering. By employing per-CPU sampling, the network management engine effectively reduces the performance impact associated with high-speed packet flows, thus ensuring network efficiency even under heavy traffic conditions.

[0178] Additionally, Envoy network filters at each hop facilitate customized response filtering and analysis. These filters enrich packet headers with latency data, enabling hop-by-hop latency analysis, which is particularly useful for applications. The network management engine continuously logs metrics to a time-series database and leverages a fault analyzer to identify and alert on network issues in near real-time, allowing for rapid intervention when problems occur.

[0179] During operation, client 310 sends a request (Request 1) to TLB 320. The network management engine starts, thus tagged each packet with an IP options timestamp as it arrives at TLB 320. This timestamp acts as the underlying latency metric, tracking each packet's journey along the path to accurately determine the time intervals between network nodes. The timestamp can be applied only to SYN packets destined for the VIP, minimizing performance overhead.

[0180] Once the request (Request 2) arrives at the ingress GW 330, additional latency data is captured. Here, the network management engine applies the eBPF (Extended Berkeley Packet Filter) procedure at the egress to measure latency and packet traversal time within the ingress GW 330's kernel. The ingress GW 330 also includes a custom Envoy network filter that captures relevant metadata such as TLB processing duration, TLB host IP, and client source IP. To efficiently manage data and prevent memory overload, the ingress GW 330 implements a per-CPU sampling technique, ensuring that only a controlled number of SYN packets are processed per core per second, thereby minimizing processing overhead. The ingress GW 330, utilizing App GW Envoy (not shown), adds a trace header to the packets, thereby capturing hop-specific latency metrics that are visible to client 310 upon receipt.

[0181] At application POD 340, the packet arrives at server Envoy agent 342 (request 3), where the network management engine enables a custom response filter. This filter is configured to include key latency metrics (e.g., X-CORP-MESH-TLB-DURATION and X-CORP-MESH-INGRSS-GW-DURATION) in the header of the outgoing response. These headers allow for precise tracking of the path each packet takes from TLB 320 through ingress GW 340 and back. Server Envoy agent 340 then forwards the packet (request 4) to server app 344 (where the packet is processed) and returns a response (response 5) to server Envoy agent 342 for the reverse path. During the return process, server Envoy agent 342 sends a response (response 6) back to ingress GW 330 via application POD 340, and then sends (response 7) back to client 310.

[0182] Upon receiving a response at client 310, the rich response headers provide the data needed to calculate hop-by-hop latency. Using these data points, the client generates a detailed analysis path graph that maps each network hop, thus showing the latency between client 310, TLB 320, ingress GW 330, application POD 340, server Envoy agent 342, and server app 344. This path graph is logged in log 350 for historical analysis.

[0183] Throughout each interaction, the TLB 320 and the ingress GW 330 continuously send service metrics to the Time Series Database (TSDB 360), which stores network management engine metrics on packet traversal rate, per-hop latency, and overall network performance. The TSDB ensures accurate timestamp records that allow for long-term trend analysis.

[0184] The Fault Analyzer 370 acts as part of the active monitoring component of the network management engine. It receives data streams from both the Log 350 and TSDB 360, examining them for latency discrepancies, packet loss, or other indicators of network degradation. When an anomaly is detected, the Fault Analyzer 370 generates an alert 380, enabling rapid troubleshooting and adjustments to routing paths or load balancing configurations.

[0185] Examples and references are available. Figure 2A , Figure 2B , Figure 2C , Figure 2D and Figure 3 To describe aspects of the technical solution. Figure 2A Based on reference Figure 6 , Figure 7 and Figure 8 The described example technical solution environment block diagram illustrates an example environment for implementing embodiments of the technical solution. Typically, this technical solution environment includes a technical solution system suitable for providing an example cloud computing system 100 that can employ the methods of this disclosure. Specifically, Figure 2A A high-level architecture of a cloud computing system 100 according to an embodiment of this disclosure is shown. Apart from other engines, managers, generators, selectors, or components not shown (collectively referred to herein as "components"), the cloud computing system 100 of FIG2 corresponds to... Figure 1 .

[0186] Example Method

[0187] refer to Figure 4A , Figure 4B , Figure 4C , Figure 5A , Figure 5B and Figure 5C Flowcharts illustrate methods for providing network management using a network management engine. These methods can be performed using the cloud computing system described herein. In embodiments, one or more computer storage media include computer-executable or computer-usable instructions that, when executed by one or more processors, can cause one or more processors to perform methods (e.g., computer-implemented methods) within a cloud computing system (e.g., a computer-implemented system or computer system).

[0188] Turn Figure 4AA flowchart illustrating method 400A for providing network management using a network management engine is provided. At box 402A, a first packet interceptor adds a tag to packets that allows for the calculation of packet latency. At box 404A, a second packet interceptor uses this tag to calculate the TLB duration, which indicates the packet latency. At box 406A, a third packet interceptor checks the TLB duration of packets. At box 408A, the network packet management extension engine stores the TLB duration.

[0189] Turn Figure 4B A flowchart illustrating method 400B for providing network management using a network management engine is provided. At box 402B, the network packet management extension engine uses transmitted packets to track packet delay metrics; at box 404B, a tag is added to the packets, which enables packet delay calculation; at box 406B, the TLB duration indicating the packet delay associated with the TLB is calculated; and at box 408B, packet delay data for tunneled packets is examined.

[0190] Turn Figure 4C A flowchart illustrating method 400C for providing network management using a network management engine is provided. At box 402C, the network packet management extension engine generates Transport Load Balancer (TLB) service metrics associated with the first and second packet interceptors; at box 404C, it generates ingress gateway (ingress GW) service metrics associated with the third packet interceptor; at box 406C, the TLB service metrics and ingress GW service metrics are stored in a Time Series Database (TSDB); at box 408C, the TLB service metrics and ingress GW service metrics are sent from the TSDB to a fault analyzer to generate alerts.

[0191] Turn Figure 5A A flowchart illustrating a method 500A for providing network management using a network management engine is provided. At block 502A, the client transmits a packet associated with the application gateway and the network packet management extension engine; at block 504A, the client receives a response packet associated with the transmitted packet; at block 506A, the client extracts packet latency data associated with the response packet; and at block 508A, the packet latency data is sent.

[0192] Turn Figure 5BA flowchart illustrating a method 500B for providing network management using a network management engine is provided. At box 502B, the fault analyzer accesses network performance data associated with the network, including packet latency data generated using multiple packet interceptors associated with the network packet management extension engine; at box 504B, bandwidth utilization associated with predefined thresholds for the network is identified; and at box 506B, an alert associated with the bandwidth utilization is sent.

[0193] Turn Figure 5C A flowchart illustrating a method 500C for providing network management using a network management engine is provided. At box 502C, a network filter accesses packet latency data associated with the network packet management extension engine; at box 504C, the packet latency data is used to update a response packet associated with a client; and at box 506C, a response packet is sent to the client.

[0194] Network Packet Management Extension Engine

[0195] The network packet management extension engine proposed in this solution addresses the complex challenge of calculating and quantifying packet latency in real time across various network topologies. Unlike conventional systems, this engine is designed to capture accurate packet-level latency metrics by employing a series of packet interceptors strategically distributed throughout the network. These interceptors are located at key nodes (i.e., within the Transport Layer Balancer (TLB) and the ingress gateway (GW), enabling packet data to be labeled, inspected, and processed in a manner that reveals fine-grained insights into the latency of each packet as it traverses the network.

[0196] Upon entering the network, packets first encounter packet interceptors located at the service control hook of the TLB. At this point, the first packet interceptor adds a timestamp to each packet. This timestamp acts as a crucial reference for calculating latency, thus capturing the time the packet arrived at the TLB host. By embedding this incoming timestamp, the engine establishes an initial data point for latency calculation, laying the foundation for accurate packet tracking as it traverses the network.

[0197] As the packet travels, it reaches a second packet interceptor located at the egress network interface of the TLB's service control hook. The task of this second interceptor is to calculate the TLB duration, a metric specifically associated with the internal processing time of the TLB. The TLB duration is derived by calculating the difference between the packet's incoming timestamp (recorded by the first interceptor) and the current time when the packet arrives at the second interceptor. This calculated TLB duration replaces the original timestamp in the packet's data. Including this duration allows the system to accurately quantify the latency occurring during TLB processing, thereby capturing any delays that may be caused by load balancing, routing, or other TLB functions.

[0198] The packet then proceeds to the third packet interceptor located at the ingress gateway, where the engine examines the packet to extract latency data, including TLB duration. At this point, the packet's latency data undergoes detailed analysis and cataloging. The extracted latency information, including the internal packet source IP, port tunnel source IP, and TLB duration, is stored in a mapping structure. This data structure is organized using key-value pairs, where the key represents a unique combination of the internal packet source IP and port, while the value stores the tunnel source IP and TLB duration. By constructing the data in this way, the engine achieves efficient organization for retrieving latency metrics, thus facilitating both rapid analysis and minimal storage overhead.

[0199] Network filters (or network traffic filters) oversee the storage and management of packet latency data within the mapping. These filters update the latency information associated with each packet as needed, ensuring that response packets sent back to the client carry accurate and up-to-date latency metrics. This update mechanism enables the engine to reflect the latest network conditions and packet traversal times, resulting in a responsive and adaptive latency tracking system.

[0200] To support the crucial synchronization packet processing during connection establishment, the engine restricts latency tracking operations to synchronization packets. This selective approach minimizes unnecessary processing overhead while maintaining high accuracy in calculating latency for new connections. By focusing on synchronization packets, the engine optimizes resource utilization, thereby improving overall network performance without compromising latency visibility.

[0201] The integration of the engine with the tunnel between the TLB and the ingress gateway further improves its efficiency in tracking latency. Typing timestamps as packets pass through the tunnel provides a method for calculating tunnel-related latency, thus capturing the total transmission time of packets along the path. Additionally, the tunnel structure accommodates unique latency data for each packet, preventing the loss of latency metrics even in high-throughput environments.

[0202] Latency data for each packet is stored in the network's time-series database and processed for in-depth analysis and fault detection. By continuously recording latency information, the time-series database allows for long-term trend analysis and historical performance monitoring. Furthermore, this data is fed into a fault analyzer that uses latency trends and discrepancies to detect potential network faults. Through real-time latency tracking and historical data analysis, the fault analyzer can identify deviations from expected network behavior, thereby alerting network administrators to proactively address potential problems.

[0203] This implementation of the network packet management extension engine provides a robust and scalable solution for accurately tracking and quantifying packet latency. By strategically deploying packet interceptors to add timestamps, compute, and store latency data, the engine achieves high visibility into packet flows, enabling real-time insights into network performance and latency characteristics in complex topologies. Network filters, a time-series database, and a fault analyzer component ensure that latency data is not only tracked but also preserved and analyzed to proactively identify network problems, representing a significant advancement in network latency monitoring and management.

[0204] Network performance management engine

[0205] The network performance management engine described in this technical solution provides an advanced, highly responsive framework for identifying and resolving deviations in network behavior, including packet latency in complex topologies. It enables both active clients and synthetic traffic generators to produce packets that reveal fine-grained latency metrics, allowing network operators to dynamically respond to changing network conditions and detect anomalies in node-level metrics.

[0206] At the heart of the network performance management engine, a strategically deployed series of packet interceptors (forming part of the network packet management extension engine) operate throughout the network, capturing packet latency metrics at specific intervals. The first packet interceptor, located at the traffic control hook of the Transport Layer Balancer (TLB), uses a timestamp to mark each incoming packet. This timestamp, added as a placeholder within the packet, captures the exact moment the packet enters the TLB, thus laying the foundation for calculating packet-specific latency metrics. As the packet travels, a second packet interceptor calculates the TLB duration by comparing the incoming timestamp to the current time, measuring the latency that occurs within the TLB during packet processing. The calculated TLB duration is then embedded within the packet, allowing for precise latency tracking and ensuring consistency across different paths and traffic flows. A third packet interceptor, located at the ingress gateway of the application gateway, checks the packet for any additional latency data before forwarding it to its destination, ensuring that latency metrics remain comprehensive and up-to-date.

[0207] The network filter in the network performance management engine facilitates the storage and management of latency data through a mapping structure. In this structure, latency metrics are efficiently organized using unique keys derived from each packet's internal source IP, port, and associated TLB duration. This approach allows the network performance management engine to accurately update and manage packet latency metrics, enabling real-time response packet updates based on constantly changing network conditions. As each client packet traverses the network, the network filter updates the corresponding response packet using relevant latency data extracted from the mapping, providing a complete latency profile when it is returned to the client. This process enables continuous monitoring of network performance at the packet level, revealing fine-grained latency information crucial for real-time anomaly detection and resolution.

[0208] In addition, the network performance management engine's management also includes generating and updating path analysis graphs, a dynamic visual representation of packet latency data associated with response packets and their routes. This graph provides detailed insights into latency across different network segments, allowing network administrators to observe latency patterns, identify deviations from normal performance, and take proactive measures as needed. By utilizing these visual analyses, administrators can not only see the overall network performance but also pinpoint specific areas where latency exceeds acceptable thresholds. Node-level service metrics reflecting bandwidth utilization at different network nodes complement this graphical data. The network performance management engine uses this data to set predefined bandwidth thresholds; once these thresholds are exceeded, alerts are triggered to notify administrators of potential network congestion or performance degradation, allowing for rapid intervention.

[0209] When packet latency data reveals poor path performance, the network performance management engine dynamically selects alternative paths to optimize packet delivery. By using latency data generated from packet interceptors, the network performance management engine identifies potential alternative paths that reduce latency and improve network performance. This dynamic rerouting mechanism not only ensures optimal path selection but also helps prevent network congestion, thereby improving overall network efficiency and responsiveness in real time.

[0210] The network performance management engine is designed to seamlessly adapt to both active network environments and synthetic traffic conditions. By supporting synthetic traffic generators, the engine can simulate various network patterns and conditions, thereby testing the network's response to dynamic traffic flows and identifying latency issues before they impact real-time services. This allows administrators to evaluate network performance under diverse conditions, providing valuable insights into how the network infrastructure responds to variable loads and identifying potential latency bottlenecks that may otherwise go undetected.

[0211] This technical solution provides a framework for real-time, packet-level network management through its integrated packet interceptor, service filter, and performance monitoring capabilities. By capturing latency data at key points within the network, dynamically updating response packets, and generating detailed path analysis, the engine facilitates comprehensive monitoring of packet flows, thereby enabling proactive anomaly detection and efficient network optimization in dynamic and complex topologies.

[0212] Technological improvements

[0213] Embodiments of the invention have been described with reference to several inventive features (e.g., operations, systems, engines, and components) associated with a project listing system. The described inventive features include the operations, interfaces, data structures, and arrangements of computing resources associated with providing the functionality described herein with respect to a network management engine associated with a cloud computing system.

[0214] Embodiments of this invention relate to the field of computing, and more specifically, to artificial intelligence systems. The exemplary embodiments described below provide a system, method, and program product, etc., for performing operations that provide network management. Therefore, this embodiment improves the field of cloud mesh networking technology by providing more network management capabilities. For example, the network management engine enhances monitoring and performance management in complex network topologies. Packet latency can be calculated and quantified in real time, allowing for accurate identification of deviations from expected network behavior. The network management engine captures fine-grained service metrics, providing a detailed view of packet-level network performance, rather than relying solely on sampled data. By leveraging advanced telemetry and real-time analytics, the network management engine supports proactive performance monitoring methods, thereby detecting subtle anomalies that might be missed by conventional systems. Furthermore, the network management engine implements correlations between node-level service metrics and observed network behavior, which helps improve troubleshooting accuracy and accelerates response to performance issues on dynamically routed network paths. These capabilities enhance network visibility, allowing for more efficient resource utilization and optimized network operations.

[0215] The functionality of embodiments of the invention has been further described through implementations and anecdotal examples—to demonstrate that the use of a network management engine in a cloud computing system to provide network management operations serves as a solution to specific problems in cloud mesh networking technology, thereby improving computing operations in cloud computing systems.

[0216] Additional support for the detailed description of the invention

[0217] Example project list system environment

[0218] Now for reference Figure 6 , Figure 6 A list of example projects, system 600 computing environments, capable of employing embodiments of this disclosure are shown. Specifically, Figure 6A high-level architecture of platform 610, which is a sample project list platform that can host a technology solution environment or a part thereof, is shown. It should be understood that this and other arrangements described herein are illustrated as examples. For instance, as mentioned above, many of the elements described herein can be implemented as discrete or distributed components or combined with other components, and implemented in any suitable combination and location. Other arrangements and elements (e.g., machine, interface, function, command, and function grouping) may be used in addition to or in place of the arrangements and elements shown.

[0219] The project listing system 600 may be a cloud computing environment that provides computing resources for the functions associated with the project listing platform 610. For example, the project listing system 600 supports the delivery of computing components and services, including servers, storage, databases, networks, applications, and machine learning associated with the project listing platform 610 and client devices 620. Multiple client devices (e.g., client device 620) include hardware or software for accessing resources on the project listing system 600. Client device 620 may include applications (e.g., client application 622) and interface data (e.g., client application interface data 624) that support client functions associated with the project listing system. Multiple client devices may access the computing components of the project listing system 600 via a network (e.g., network 630) to perform computing operations.

[0220] The project listing platform 610 is responsible for providing a computing environment or architecture, which includes the infrastructure to support the provision of project listing platform functionality (e.g., e-commerce functionality). The project listing platform supports storing projects in a project database and provides a search system for receiving queries and identifying search results based on those queries. The project listing platform can also provide a computing environment with features for managing, selling, purchasing, and recommending different types of projects. The project listing platform 610 can be specifically used as a content platform, such as the eBay content platform or e-commerce platform developed by eBay Inc. of San Jose, California.

[0221] The project list platform 610 can provide project list operations 630 and project list interfaces 640. Project list operations 630 may include service operations, communication operations, resource management operations, security operations, and fault-tolerant operations that support specific tasks or functions within the project list platform 610. Project list interfaces 640 may include service interfaces, communication interfaces, resource interfaces, security interfaces, and management and monitoring interfaces that support functions between project list platform components. Project list operations 630 and project list interfaces 640 enable communication, coordination, and seamless operation of the project list system 600.

[0222] As an example, the functionalities associated with the item listing platform 610 may include: shopping operations (e.g., product search and browsing, product selection and shopping cart, checkout and payment, and order tracking); user account operations (e.g., user registration and authentication, and user profiles); seller and product management operations (e.g., seller registration, product listing and inventory management); payment and financial operations (e.g., payment processing, refunds and returns); order fulfillment operations (e.g., order processing and fulfillment, and inventory management); customer support and communication interfaces (e.g., customer support chat / email and notifications); security and privacy interfaces (e.g., authentication and authorization, payment security); recommendation and personalization interfaces (e.g., product recommendations and customer reviews and ratings); analytics and reporting interfaces (e.g., sales and inventory reports, and user behavior analytics); and APIs and integration interfaces (e.g., APIs for third-party integration).

[0223] The project listing platform 610 can provide a project listing platform database (e.g., project listing platform database 650) to effectively manage and store different types of data. The project listing platform database 650 may include relational databases, NoSQL databases, search databases, cache databases, content management systems, analytics databases, payment gateway databases, customer relationship management databases, log and error databases, inventory and supply chain databases, and multi-channel databases, which are used in combination to effectively manage data and provide users with an e-commerce experience.

[0224] The project listing platform 610 supports applications (e.g., application 660), which are computer programs, software components, or services that serve specific functions or sets of functions to meet the requirements of a specific project listing platform or user requirements. Applications can be client-side (user-facing) and server-side (backend). Applications can also include applications without any AI support (e.g., application 662), applications supported by traditional AI models (e.g., application 664), and applications supported by generative AI models (e.g., application 666). As examples, applications may include online storefront applications, mobile shopping applications, application and management consoles, payment gateway integrations, user account and authentication applications, search and recommendation engines, inventory and stock management applications, order processing and fulfillment applications, customer support and communication tools, content management systems, analytics and reporting applications, marketing and promotion applications, multi-channel integration applications, log and bug tracking applications, customer relationship management (CRM) applications, security applications, and APIs and web services used in combination to effectively provide users with an e-commerce experience.

[0225] Platform 610 in the project list may include a machine learning engine (e.g., machine learning engine 670). Machine learning engine 670 refers to a machine learning framework or platform that provides the infrastructure and tools for designing, training, evaluating, and deploying machine learning models. Machine learning engine 670 can serve as the backbone for developing and deploying machine learning applications and solutions. Machine learning engine 670 may also provide tools for visualizing data and model results, as well as interpreting model decisions to understand how the model makes predictions.

[0226] The Machine Learning Engine 670 provides the necessary libraries, algorithms, and utilities to perform various tasks within a machine learning workflow. Machine learning workflows can include data processing, model selection, model training, model evaluation, hyperparameter tuning, scalability, model deployment, inference, integration, customization, and data visualization. The Machine Learning Engine 670 can include pre-trained models for various tasks, thus simplifying the development process. In this way, the Machine Learning Engine 670 streamlines the entire machine learning process, from data preparation and model training to deployment and inference, making it accessible and efficient for different types of users engaged in a wide range of machine learning applications—e.g., clients, data scientists, machine learning engineers, and developers.

[0227] Machine learning engine 670 can be implemented as a component in item list system 600, which leverages machine learning algorithms and techniques (e.g., machine learning algorithm 672) to enhance various aspects of the item list system's functionality. Machine learning engine 670 can provide a range of machine learning algorithms and techniques for teaching computers to learn from data and make predictions or decisions without explicit programming. These techniques are widely used in various applications across different industries and include examples such as: supervised learning (e.g., linear regression: classification, support vector machines (SVM); unsupervised learning (e.g., clustering, principal component analysis (PCA), association rules (e.g., apriori); reinforcement learning (e.g., Q-learning, deep Q-networks (DQN); deep learning (e.g., neural networks, convolutional neural networks (CNN), and recurrent neural networks (RNN); and ensemble learning random forests).

[0228] Machine Learning Training Data 120 supports the process of building, training, and fine-tuning machine learning models. Machine Learning Training Data 120 consists of labeled datasets used to teach machine learning models to recognize patterns, make predictions, or perform specific tasks. Training data typically includes two main components: input features (X) and labels or target values ​​(Y). Input features can include variables, attributes, or characteristics that serve as input to the machine learning model. Input features (X) can be numerical, categorical, or even textual, depending on the nature of the problem. For example, in a model used to predict house prices, input features might include the number of bedrooms, square feet, neighborhood, etc. Labels or target values ​​(Y) include the values ​​that the model aims to predict or classify. Labels represent the expected output or basic fact for each corresponding set of input features. For example, in a spam classifier, labels would indicate whether each email is spam (i.e., binary classification). The training process involves presenting training data to the model, and the model learns to make predictions or decisions by recognizing patterns and relationships between the input features (X) and target values ​​(Y). Machine learning algorithms adjust their internal parameters during training to minimize the discrepancy between their predictions and the actual labels in the training data. The Machine Learning Engine 670 can use historical and real-time data to train models and make predictions, continuously improving performance and user experience.

[0229] Machine learning engine 670 may include machine learning models (e.g., machine learning model 676) generated using machine learning engine workflows. Machine learning model 676 may include generative AI models and traditional AI models, both of which can be used in item listing system 600. Generative AI models are designed to generate new data (typically in the form of text, images, or other media) based on patterns and knowledge learned from existing data. Generative AI models can be used in a variety of ways, including: content generation, product image generation, personalized product recommendations, natural language chatbots, and content summarization. Traditional AI models encompass a wide range of algorithms and techniques and can be used in a variety of ways, including: recommender systems, predictive analytics, search algorithms, fraud detection, customer segmentation, image classification, natural language processing (NLP), and A / B testing and optimization. In many cases, a combination of generative AI models and traditional AI models can be used to deliver a comprehensive and efficient e-commerce experience, combining data-driven insights and creativity.

[0230] Machine learning engines 670 can be used to analyze data, make predictions, and automate processes to provide users with a more personalized and efficient shopping experience. Examples include product recommendation search and filtering; pricing optimization; inventory and stock management; customer segmentation; customer churn prediction and retention; fraud detection; sentiment analysis; customer support and chatbots; image and video analytics; and ad targeting and marketing. The specific applications of machine learning within the project listing platform 610 can vary depending on specific objectives, available data, and resources.

[0231] The Project Listing System 600 provides project listing system data, which offers customer service interaction information. Therefore, it can work with a Customer Service Management System (CMS) to resolve any issues arising from these project listings. The CMS can be a software solution designed to simplify and automate the processing of customer inquiries and support requests across various communication channels. The CMS centrally manages customer interactions, allowing service teams to efficiently categorize, prioritize, and resolve issues, while tracking and managing the entire lifecycle of each case. Through integrated tools such as ticketing systems, knowledge bases, and automation features like AI-driven chatbots, the CMS reduces response times, minimizes manual intervention, and ensures consistently high-quality customer service. Project listing systems and CMSs can be integrated to ensure seamless communication and efficient resolution of customer issues.

[0232] Example Distributed Computing System Environment

[0233] Now for reference Figure 7 , Figure 7 An example distributed computing environment 700 capable of employing embodiments of the present disclosure is shown. Specifically, Figure 7 A high-level architecture of an example cloud computing platform 710 capable of hosting a technology solution environment or a portion thereof (e.g., a data trustee environment) is illustrated. It should be understood that this and other arrangements described herein are illustrative only. For example, as stated above, many of the elements described herein can be implemented as discrete or distributed components or in combination with other components, and implemented in any suitable combination and location. Other arrangements and elements (e.g., machines, interfaces, functions, commands, and function groups) may be used in addition to or in place of the arrangements and elements shown.

[0234] The data center can support a distributed computing environment 700, which includes a cloud computing platform 710, racks 720, and nodes 730 (e.g., computing devices, processing units, or blades) within the racks 720. A technology solution environment can be implemented using the cloud computing platform 710, which runs cloud services across different data centers and geographic regions. The cloud computing platform 710 can implement a fabric controller 740 component for providing and managing the resource allocation, deployment, upgrades, and management of cloud services. Typically, the cloud computing platform 710 is used to store data or run service applications in a distributed manner. The cloud computing infrastructure 710 in the data center can be configured to host and support the operation of endpoints for specific service applications. The cloud computing infrastructure 710 can be a public cloud, a private cloud, or a dedicated cloud.

[0235] Node 730 may provide a host 750 (e.g., an operating system or runtime environment) on which a defined software stack runs on Node 730. Node 730 may also be configured to perform specialized functions (e.g., compute node or storage node) within the cloud computing platform 710. Node 730 is allocated to run one or more portions of a tenant's service application. A tenant may refer to a customer utilizing the resources of the cloud computing platform 710. The service application components of the cloud computing platform 710 supporting a particular tenant may be referred to as multi-tenant infrastructure or lease. In this document, the terms service application, application, or service are used interchangeably and broadly refer to any software or software portion that runs on top of or accesses storage and compute equipment locations within a data center.

[0236] When node 730 is supporting multiple individual service applications, node 730 can be partitioned into virtual machines (e.g., virtual machine 752 and virtual machine 754). Physical machines can also run individual service applications simultaneously. Virtual machines or physical machines can be configured as personalized computing environments supported by resources 760 (e.g., hardware and software resources) in the cloud computing platform 710. It is envisioned that resources can be configured for specific service applications. Furthermore, each service application can be divided into functional parts, allowing each functional part to run on a separate virtual machine. In the cloud computing platform 710, multiple servers can be used to run service applications and perform data storage operations within a cluster. Specifically, servers can perform data operations independently but are exposed as a single device referred to as a cluster. Each server in the cluster can be implemented as a node.

[0237] Client device 780 can connect to service applications in cloud computing platform 710. Client device 780 can be configured to correspond to a reference... Figure 7Any type of computing device described in the computing device 700, such as client device 780, can be configured to issue commands to the cloud computing platform 710. In embodiments, client device 780 can communicate with service applications via Virtual Internet Protocol (IP) and load balancers or other means that direct communication requests to a specified endpoint in the cloud computing platform 710. Components of the cloud computing platform 710 can communicate with each other via a network (not shown), which may include, but is not limited to, one or more local area networks (LANs) and / or wide area networks (WANs).

[0238] Example computing environment

[0239] Having briefly described the overview of embodiments of the present invention, the following describes an example operating environment in which embodiments of the present invention can be implemented, in order to provide a general context for various aspects of the present invention. Specifically, reference is made first to... Figure 8 An example operating environment for implementing embodiments of the present invention is shown and is generally designated as computing device 800. Computing device 800 is merely an example of a suitable computing environment and is not intended to imply any limitation on the scope or functionality of the invention. Nor should computing device 800 be construed as having any dependency or requirement associated with any one or combination of the illustrated components.

[0240] This invention can be described in the general context of computer code or machine-usable instructions, including computer-executable instructions (e.g., program modules) that are executed by a computer or other machine (e.g., a personal data assistant or other handheld device). Generally, a program module, including routines, programs, objects, components, data structures, etc., refers to code that performs a specific task or implements a specific abstract data type. This invention can be practiced in a variety of system configurations, including handheld devices, consumer electronics, general-purpose computers, and more specialized computing devices. This invention can also be practiced in distributed computing environments, where tasks are performed by remote processing devices linked via a communication network.

[0241] refer to Figure 8 The computing device 800 includes a bus 810 that directly or indirectly couples to the following devices: memory 812, one or more processors 814, one or more presentation components 816, input / output ports 818, input / output components 820, and a schematic power supply 822. Bus 810 represents one or more buses (such as an address bus, a data bus, or a combination thereof). For clarity of concept, Figure 8 The various boxes are shown with lines, and other arrangements of the described components and / or component functions are also envisioned. For example, a presentation component such as a display device can be considered an I / O component. Additionally, a processor has memory. We recognize this as essential to the art and reiterate... Figure 8The figures only illustrate example computing devices that can be used in conjunction with one or more embodiments of the present invention. There is no distinction between categories such as "workstation," "server," "laptop," and "handheld device," as all these categories are... Figure 8 Within the scope and with reference to "Computing Devices".

[0242] Computing device 800 typically includes a variety of computer-readable media. Computer-readable media can be any available medium that can be accessed by computing device 800, and includes volatile and non-volatile media, removable and non-removable media. By way of example and not limitation, computer-readable media can include computer storage media and communication media.

[0243] Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information, such as computer-readable instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, Digital Universal Optical Disc (DVD) or other optical disc storage devices, magnetic tape cassettes, magnetic tape, disk storage devices or other magnetic storage devices, or any other medium capable of storing desired information and accessible by a computing device 800. Computer storage media itself does not include signals.

[0244] Communication media typically embody computer-readable instructions, data structures, program modules, or other data in the form of modulated data signals (e.g., carrier waves or other transmission mechanisms), and include any information transmission medium. The term "modulated data signal" refers to a signal whose characteristics are set or altered in a manner that encodes information in the signal. By way of example, and not limitation, communication media include wired media such as wired networks or direct wired connections, and wireless media such as acoustic, RF, infrared, and other wireless media. Any combination of the above should also be included within the scope of computer-readable media.

[0245] Memory 812 includes computer storage media in the form of volatile memory and / or non-volatile memory. The memory can be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard disk drives, optical disk drives, etc. Computing device 800 includes one or more processors that read data from various entities such as memory 812 or I / O components 820. Presentation component 816 presents data indications to a user or other device. Exemplary presentation components include display devices, speakers, printing components, vibration components, etc.

[0246] I / O port 818 allows computing device 800 to be logically coupled to other devices, some of which may be built-in, including I / O components 820. Illustrative components include microphones, joysticks, game controllers, satellite antennas, scanners, printers, wireless devices, etc.

[0247] Additional structural and functional features of embodiments of this technical solution

[0248] Having identified the various components used herein, it should be understood that any number of components and arrangements can be employed to achieve the desired functionality within the scope of this disclosure. For example, for clarity of concept, components in the embodiments depicted in the accompanying drawings are shown in lines. Other arrangements of these and other components can also be implemented. For instance, although some components are depicted as single components, many elements described herein can be implemented as discrete or distributed components or combined with other components, and implemented in any suitable combination and location. Some elements may be omitted entirely. Furthermore, as described below, the various functions performed by one or more entities described herein can be performed by hardware, firmware, and / or software. For example, various functions can be performed by a processor executing instructions stored in memory. Therefore, other arrangements and elements (e.g., machines, interfaces, functions, commands, and function groups) can also be used as supplements to or alternatives to the arrangements and elements shown.

[0249] The embodiments described in the following paragraphs can be combined with one or more of the specifically described alternatives. Specifically, the claimed embodiments may include references to more than one other embodiment in the alternatives. The claimed embodiments may specify additional limitations on the claimed subject matter.

[0250] This document specifically describes the subject matter of embodiments of the invention to meet legal requirements. However, this specification itself is not intended to limit the scope of this patent. Rather, the inventors have envisioned that the claimed subject matter may also be embodied in other ways in combination with other prior art or future art to include different steps or combinations of steps similar to those described in this document. Furthermore, although the terms “step” and / or “box” may be used herein to denote different elements of the method employed, such terms should not be construed as implying any particular order between or between the various steps disclosed herein, unless and only if the order of the various steps is explicitly described.

[0251] For the purposes of this disclosure, the word “comprising” has the same broad meaning as the word “including”, and the word “access” includes “receiving,” “quoting,” or “retrieval.” Furthermore, the word “communication” has the same broad meaning as the words “receiving” or “transmitting” facilitated by a software- or hardware-based bus, receiver, or transmitter using the communication medium described herein. Additionally, unless otherwise stated, words such as “a” or “an” include both plural and singular forms. Thus, for example, the constraint “one feature” is satisfied when one or more features are present. Furthermore, the term “or” includes conjunctions, disjuncts, and both (a or b therefore includes a or b, and a and b).

[0252] For the purposes of the detailed discussion above, embodiments of the present invention are described with reference to a distributed computing environment; however, the distributed computing environment described herein is merely exemplary. Components may be configured to perform novel aspects of the embodiments, wherein the term "configured for" may mean "programmed to" perform a specific task or implement a specific abstract data type using code. Furthermore, while embodiments of the present invention can generally be referred to in the technical environment and illustrations described herein, it should be understood that the described techniques can be extended to other implementation contexts.

[0253] Embodiments of the invention have been described with respect to specific embodiments intended to be illustrative and not limiting in all respects. Alternative embodiments will become apparent to those skilled in the art without departing from the scope of the invention.

[0254] As can be seen from the foregoing, this invention is well-suited to achieving all the goals and objectives set forth herein, as well as other obvious and inherent advantages of the structure.

[0255] It should be understood that certain features and sub-combinations are useful and can be employed without reference to other features or sub-combinations. This is contemplated by the claims and is within the scope of the claims.

Claims

1. A computer-implemented system, comprising: One or more computer processors; as well as A computer memory that stores computer-usable instructions, which, when used by the one or more computer processors, cause the one or more computer processors to perform operations, the operations including: Packets associated with the application gateway and the network packet management extension engine are transmitted from the client, wherein the network packet management extension engine includes multiple packet interceptors that support tracking packet latency metrics using packets sent via the application gateway; Based on the transmission of the packet, a response packet associated with the packet is received, the packet being associated with a packet latency metric tracked using the plurality of packet interceptors; Extract packet delay data associated with the response packets and the packet delay metric; and Send the packet delay data.

2. The system according to claim 1, wherein the operation further includes: Using the packet delay data, generate the per-hop delay associated with the packet.

3. The system of claim 1, further comprising transmitting the packet delay data to update the packet delay log associated with the path analysis graph, wherein, The path analysis graph is a visual representation of packet delay data associated with the response packets and their corresponding routes.

4. The system according to claim 1, the operation further includes dynamically selecting alternative paths for subsequent packets based on the packet delay data.

5. The system according to claim 1, wherein, The network filter updates the response packets based on the packet delay data.

6. The system according to claim 1, wherein, The network filter supports lookup and delete operations that can be performed on the mapped data structure storing the packet delay data to support updating the response packets.

7. The system according to claim 1, wherein, The client is a synthetic service generator, which simulates network service modes and conditions for testing and evaluating the network's performance.

8. One or more computer storage media having computer-executable instructions thereon, the computer-executable instructions, when executed by a computing system having a processor and memory, causing the processor to perform operations, the operations including: Access network performance data associated with the network, wherein the network performance data includes packet latency data generated using multiple packet interceptors; Determining that bandwidth utilization meets a predefined threshold of the network, wherein determining that bandwidth utilization meets the predefined threshold of the network is based on packet latency data tracked using the plurality of packet interceptors in the network; and Send an alert associated with the bandwidth utilization.

9. The medium according to claim 8, wherein, The network performance data includes a path analysis graph based on response packets from one or more clients, wherein the path analysis graph is a visual representation of packet latency data associated with the response packets and their corresponding routes.

10. The medium according to claim 8, wherein, The network performance data includes service metrics associated with the transport layer balancer (TLB) and the ingress gateway.

11. The medium according to claim 8, wherein, The network includes an application gateway and a transport layer balancer (TLB) that is operatively tunneled to the ingress gateway.

12. The medium according to claim 8, wherein, The plurality of packet interceptors includes a first packet interceptor configured to add an incoming time to a timestamp option in an incoming packet, wherein the first packet interceptor is operatively coupled to the service control hook of the TLB.

13. The medium according to claim 8, wherein, The plurality of packet interceptors includes a second packet interceptor configured to calculate the duration of a Transport Layer Balancer (TLB) indicating packet delay, wherein the second packet interceptor is operatively coupled to the egress network interface of the TLB’s service control hook.

14. The medium according to claim 8, wherein, The plurality of packet interceptors includes a third packet interceptor configured to inspect tunneled packets to obtain packet delay data, wherein the third packet interceptor is operatively coupled to the ingress network interface of the ingress gateway’s service control hook.

15. The media according to claim 8, wherein, The alarm is associated with a node-level service related to the network.

16. The medium according to claim 8, wherein, The determination that the bandwidth utilization has been met is based on associating the increase in packet latency obtained through the packet latency data with the increase in service load in the network.

17. A computer-implemented method, the method comprising: Access response packets associated with packet latency data determined from multiple packet interceptors at the network filter, which support tracking packet latency metrics using sent packets; Use the packet delay data to update the response packet associated with the client; as well as The response packet is sent to the client.

18. The computer-implemented method according to claim 17, wherein, The network filter supports lookup and deletion operations that can be performed on the mapping data structure storing the packet delay data to support the generation of the response packets.

19. The computer-implemented method according to claim 17, wherein, The packet delay data is stored in a mapping data structure associated with keys and values, wherein the keys are based on the internal packet source IP and port, and the values ​​are based on the tunnel source IP and TLB duration.

20. The computer-implemented method according to claim 17, wherein, The network filter is operatively coupled to the server application and the ingress gateway of the application gateway, the application gateway including a transport layer balancer (TLB) operatively coupled to the ingress gateway via a tunnel.