A health examination system and method
The distributed health check system based on the agent-server architecture solves the problem of high maintenance costs in existing technologies, enables support for larger-scale businesses, and adapts to the expansion of new businesses and scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA TELECOM CLOUD TECH CO LTD
- Filing Date
- 2024-12-06
- Publication Date
- 2026-06-12
Smart Images

Figure CN119814609B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computing network technology, and in particular to a health check system and method. Background Technology
[0002] In computing network environments, it is often necessary to implement functions such as load balancing, high availability, and QoS to ensure efficient concurrent access from clients. These functions require underlying support from health checks or probes.
[0003] However, when performing health checks, most of the related technologies process probes within a single cluster or a single scenario, which cannot support larger-scale business operations. Furthermore, the probe methods are scattered across various systems, and the methods for diagnosis, troubleshooting, and maintenance are diverse, requiring higher maintenance costs. Summary of the Invention
[0004] This invention provides a health check system and method to offer scalability for the health check system, thereby supporting larger-scale business operations while reducing maintenance costs.
[0005] A first aspect of this invention provides a health check system, the system comprising at least: a console, a server, and an agent;
[0006] The console is used to send a creation request to the server, and the creation request is used to create a health check task;
[0007] The server is used to determine the target probe deployed in the agent based on the creation request;
[0008] The agent is used to send a pull request to the server to obtain the probe configuration corresponding to the health check task in the server.
[0009] The agent is used to detect the target object through the target detection machine according to the detection configuration, obtain the detection result, and report the detection result to the server.
[0010] The server is used to query the original detection results of the detected object, compare the original detection results with the detected results, and determine that the state of the detected object has changed if there is a difference between the original detection results and the detected results.
[0011] A second aspect of this invention provides a health check method, applied to the server side of the health check system described in the first aspect of this invention, the method comprising:
[0012] Receive a creation request sent from the console, the creation request being used to create a health check task;
[0013] Based on the creation request, the target probe machine deployed in the agent is determined, and probe machine confirmation information is sent to the agent corresponding to the target probe machine;
[0014] The system receives and retrieves the detection configuration corresponding to the health check task based on the pull request sent by the agent corresponding to the target detector, and returns it to the agent so that the agent can detect the target object through the target detector according to the detection configuration, obtain the detection result, and report the detection result to the server.
[0015] The original detection results of the object are queried, and the original detection results are compared with the detection results. If there is a difference between the original detection results and the detection results, it is determined that the state of the object has changed.
[0016] A third aspect of this invention provides a health check method, applied to an agent in the health check system described in the first aspect of this invention, the method comprising:
[0017] The server receives a probe confirmation message from the server. The probe confirmation message is generated by the server after receiving a creation request from the console and determining the target probe deployed in the agent based on the creation request. The creation request is used to create a health check task.
[0018] Send a pull request to the server to obtain the detection configuration corresponding to the health check task returned by the server;
[0019] According to the detection configuration, the target detector detects the target object, obtains the detection result, and reports the detection result to the server so that the server can query the original detection result of the target object, compare the original detection result with the detection result, and determine that the state of the target object has changed if there is a difference between the original detection result and the detection result.
[0020] The health check system provided in this embodiment of the invention includes at least a console, a server, and an agent. The console sends a creation request to the server to create a health check task. The server determines the target probe deployed in the agent based on the creation request. The agent sends a pull request to the server to obtain the probe configuration corresponding to the health check task from the server. According to the probe configuration, the agent probes the target object using the target probe to obtain the probe results and reports the results to the server. The server queries the original probe results of the target object, compares the original results with the current detection results, and determines that the state of the target object has changed if there is a difference between the original results and the current detection results, thus realizing a health check for the target object. Therefore, this embodiment implements a distributed health check system through an agent-server architecture, providing scalability for the health check system. For new businesses and scenarios, no modification to the health check system is required, or only a few modules need to be added, achieving support for a larger scale of business while reducing maintenance costs. Attached Figure Description
[0021] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0022] Figure 1 This is a structural block diagram of a health check system according to an embodiment of the present invention;
[0023] Figure 2 This is a framework diagram of a distributed health check system proposed in an embodiment of the present invention;
[0024] Figure 3 This is a diagram illustrating an agent deployment architecture according to an embodiment of the present invention;
[0025] Figure 4 This is a schematic diagram illustrating a mapping relationship according to an embodiment of the present invention;
[0026] Figure 5 This is a module partitioning diagram of an agent and a server according to an embodiment of the present invention;
[0027] Figure 6 This is a flowchart of a health check method provided in an embodiment of the present invention;
[0028] Figure 7 This is a flowchart of a health check method provided in an embodiment of the present invention;
[0029] Figure 8 This is an overall flowchart of a health check method proposed in an embodiment of the present invention. Detailed Implementation
[0030] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0031] Common health check functions in related technologies include:
[0032] 1. Probe Scheduling: The number of probes and target nodes increases with the increase in business. If all probes and target nodes perform full probes, it will generate a large number of probe requests and affect the performance of target nodes. Load balancing of probes can be achieved through scheduling. Scheduling methods can use graph algorithms to select probes, etc.
[0033] 2. Probing methods: Common probing methods include UDP, TCP, ICMP and HTTP, as well as simulations combined with business requests.
[0034] 3. Supported detection scenarios: Result replacement in DDNS, service switching in service discovery, and fault removal and recovery in ECMP (Equal-cost multi-path routing) and load balancing.
[0035] 4. Probe performance: From the perspective of the data plane, partial link probing is used, and then the results are synchronized to other service devices via shared memory.
[0036] 5. Probe cluster management: Business-related devices, such as load balancers and ECMP, can be used to group the clusters; alternatively, independent probe devices can be used to group the clusters.
[0037] 6. Monitoring of the detection system: In order to ensure the detection quality, the detection system itself needs to be monitored in different dimensions to ensure the stability and reliability of the detection results. The methods include component monitoring, detection machine operating system monitoring (such as CPU, memory, IO, etc.), and log monitoring.
[0038] 7. Detection Result Data Analysis: Packet loss, latency, and inspection time of the detection results are combined with other system indicators (such as CPU, memory, IO, etc.) to perform weighted quality analysis.
[0039] However, most of the related technologies are designed for detection within a single cluster or scenario, which cannot support larger-scale business operations. Furthermore, the detection methods are scattered across various systems, and the methods for diagnosis, troubleshooting, and maintenance are diverse, requiring higher maintenance costs.
[0040] Therefore, to at least partially address one or more of the aforementioned problems and other potential issues, this invention proposes a health check system, primarily providing scalability, including horizontal and business-level expansion. Horizontal expansion mainly includes stateless component design and consistent hashing for balanced requests. Business-level expansion refers to the system's ability to adapt to business systems compliant with the IP protocol, providing the capability for callback state switching. This invention implements a distributed health check system through an agent-server architecture. For new businesses and scenarios, no modification to the health check system is required, or only a few modules need to be expanded, achieving support for larger-scale business operations while reducing maintenance costs.
[0041] refer to Figure 1 , Figure 1 This is a structural block diagram of a health check system according to an embodiment of the present invention. Figure 1 As shown, in this embodiment, since the health check data model is relatively small, the health check configuration does not change much, and the main detection processes are actually similar, they can be uniformly processed through model abstraction. Therefore, the health check system is designed as an agent-server architecture, i.e., a distributed system architecture. Specifically, the health check system in this embodiment includes at least: a console, a server, and an agent.
[0042] The console is used to send a creation request to the server, and the creation request is used to create a health check task.
[0043] In this embodiment, the console is used to operate resources. For health checks, users can initiate a creation request through the console. The console can send the creation request to the server via HTTP. The creation request is used to create a health check task, which is used to perform a health check on the business service.
[0044] The server is used to determine the target probe machine deployed in the agent based on the creation request.
[0045] In this embodiment, after receiving the creation request, the server can select a target probe from the probes. This target probe is the probe that will perform the health check task. In this embodiment, the probe is deployed within the agent.
[0046] In a preferred embodiment, the probe is deployed as a plug-in within the agent. This embodiment, with its unified model abstraction and plug-in-the-plugin approach to probing within the agent, is applicable to most scenarios, such as load balancing, ECMP, and website liveness status, and supports centralized management.
[0047] In one optional embodiment, probes deployed in the agent can be added via the console or through automatic registration initiated by the agent. When adding a probe, its region and label information are set according to actual conditions to facilitate server-side scheduling. The load information of the created node (i.e., the created probe) is 0 by default. In this embodiment, the probe addition steps are repeated as needed until a sufficient number of probes are available.
[0048] In one embodiment, after the server determines the target probe deployed in the agent, it can generate probe confirmation information based on the target probe. The server can send the probe confirmation information to the agent corresponding to the target probe via GRPC, or the agent can periodically pull the corresponding probe confirmation information from the server via GRPC to inform the agent of business changes. The business changes can include changes in the probe configuration, including at least any of the following: adding or deleting probe targets, increasing or decreasing the probe frequency.
[0049] The agent is used to send a pull request to the server to obtain the detection configuration corresponding to the health check task in the server.
[0050] In this embodiment, the agent corresponding to the target probe can send a pull request to the server. The server retrieves the probe configuration corresponding to the probe object in the health check task based on the pull request and returns the probe configuration to the agent, enabling the agent to obtain the probe configuration corresponding to the health check task from the server. The server can store probe configurations for multiple probe objects.
[0051] The agent is used to detect the target object through the target detection machine according to the detection configuration, obtain the detection result, and report the detection result to the server.
[0052] In this embodiment, the agent can, based on the acquired detection configuration, probe the target object through the deployed target probe machine, obtain the detection results, and report the results to the server. In this embodiment, the probe object can refer to the business service to be checked in the health check task. The detection result includes at least: the detection status.
[0053] In one optional embodiment, after obtaining the detection configuration, the agent can use a coroutine or process of the local process to detect the target object, or use an additional program to detect the target object, and then collect the detection results.
[0054] The server is used to query the original detection results of the detected object, compare the original detection results with the detected results, and determine that the state of the detected object has changed if there is a difference between the original detection results and the detected results.
[0055] In this embodiment, after receiving the detection results for the target object reported by the agent, the server can query the original detection results for the target object, which may refer to the previous detection results for the target object. Then, the original detection results are compared with the current detection results to determine if there is a difference. If the server determines that there is a difference, it can determine that the status of the target object has changed and obtain the health check result of the target object. For example, if the original detection results showed the crude oil detection status as "active," while the current detection results show an abnormal status (such as packet loss or keyword not found), then it can be determined that there is a difference between the original detection results and the current detection results, indicating a change in the status of the target object, and thus obtaining the health check result of the target object.
[0056] In this embodiment, a distributed health check system is implemented through an agent-server architecture. Health checks on probed objects can be achieved through the interaction between the console, server, and agent. For new services and scenarios, the health check system can be modified or only a few modules can be added, thus supporting a larger scale of business while reducing maintenance costs.
[0057] Furthermore, in conjunction with the above embodiments, in one implementation, there are multiple target detectors, and the aforementioned "detecting the target object through the target detector and obtaining the detection result" may specifically include:
[0058] The detection results are obtained by determining the detection object for each target probe.
[0059] The number of target detectors with normal detection results / the total number of effective detectors is determined, and the number of target detectors with abnormal detection results / the total number of effective detectors is determined.
[0060] If the number of target detectors with normal detection results / the total number of effective detectors is greater than a first threshold, the agent can determine that the detection status of the target is normal and obtain the detection result.
[0061] If the number of target detectors with abnormal detection results / the total number of effective detectors is greater than the second threshold, the agent can determine that the detection status of the target is abnormal and obtain the detection result.
[0062] In conjunction with the above embodiments, in one implementation, this embodiment of the invention also provides a health check system. Specifically, in this embodiment, the server is further configured to, when the state of the probed object changes, perform a callback according to the callback processing method in the probe configuration, and notify the corresponding business system.
[0063] In this embodiment, the server can also trigger a callback according to the callback processing method in the probe configuration when it determines that the state of the probe object has changed. This callback will notify the corresponding business system of the health check results, for example, through email alerts or traffic switching. This embodiment supports custom callbacks and can be integrated with various business or notification systems. Since the health check system remains unchanged, no stability issues will be introduced. In other words, this embodiment allows for state switching by adding a new business system through custom callbacks, without requiring changes to the health check system.
[0064] In conjunction with the above embodiments, in one implementation, the present invention also provides a health check system. Specifically, in this embodiment, the above-mentioned "sending a pull request to the server to obtain the probe configuration corresponding to the health check task in the server" may specifically include the following steps:
[0065] Step S11: Perform consistent hashing on the hostname to determine the first server among multiple servers.
[0066] In this embodiment, there are multiple servers; that is, the health check system agent includes multiple servers. The agent can perform consistent hashing on the hostname to determine the first server from the multiple servers. The number of servers can be obtained through additional service discovery or the DNS system. The first server is the one that provides the probe configuration for the health check task.
[0067] Step S12: Send the pull request to the first server based on the address of the first server to obtain the probe configuration in the first server.
[0068] In this embodiment, after identifying the first server, the agent can send a pull request to the first server based on the server's address to obtain the probe configuration from the first server. Specifically, the agent can send a pull request to the first server, and the first server can obtain the probe configuration corresponding to the health check task based on the pull request and return it to the agent.
[0069] In this embodiment, consistent hashing of the hostname is performed to obtain the address of the specific first server before requesting the probe configuration. This ensures that the requests are evenly distributed, achieves load balancing, and reduces the amount of data migration after a server failure.
[0070] In conjunction with the above embodiments, in another implementation, this embodiment of the invention also provides a health check system. Specifically, in this embodiment, the above-mentioned "reporting the detection results to the server" may include the following steps:
[0071] Step S21: Perform a consistent hash on the health check ID of the health check task to determine the second server among multiple servers.
[0072] In this embodiment, there are multiple servers, meaning the health check system agent includes multiple servers. The agent can probe the target based on a probe frequency (e.g., once every 1-60 minutes), obtaining multiple probe results, and then determining the final probe result from these multiple results. Furthermore, the agent can use the health check ID of the health check task to perform a consistent hash calculation to determine a second server from the multiple servers. This second server is the one that reports the probe results for the health check task. It should be noted that the first server and the second server in this invention can be the same server or different servers; this is not limited.
[0073] Step S22: Report the detection results to the second server.
[0074] In this embodiment, after determining the second server, the final determined detection results can be reported to the second server.
[0075] In a specific example, the agent can obtain a list of servers, i.e., identify multiple servers of the health check system. This could be achieved through an additional service discovery system or a DNS system. For instance, it could interface with a DNS system to obtain the server list. Consistent hashing can be performed using the health check ID to calculate the mapping relationship `hc_id->server_ip`. Then, `server_ip->[]{hc_id}` can be aggregated, and a second server can be selected from the server categories. The probe results are then reported in batches to the second server. Here, `hc_id` is the health check ID, and `server_ip` is the server address.
[0076] Analyzing the detection results on the server side is often CPU-intensive, so horizontal scaling needs to be considered. Therefore, in this embodiment, the health check ID is consistently hashed and then reported to the second server in batches to achieve horizontal scaling and support a larger scale of business.
[0077] In conjunction with the above embodiments, in another implementation, this embodiment of the invention also provides a health check system. In this embodiment, the server is further used to manage the information of the detector, the detector information including at least: load information, label information, and region information.
[0078] In this embodiment, the above-mentioned "determining the target probe deployed in the agent" may specifically include: selecting the target probe for performing the health check task based on the minimum load information, tag information, and region information, taking into account the actual packet volume and CPU consumption.
[0079] In this embodiment, the server can select the target probe to perform the health check task from multiple probes based on the probe information, including minimum load information, tag information, and region information, taking into account the actual packet volume and CPU consumption. The selected target probe is then written to a persistent database. This embodiment, by calculating changes in probe load values, achieves a more balanced distribution of traffic and load compared to scheduling based on the probe object.
[0080] For example, the scheduling strategy for screening detectors may include: distributing them across regions as much as possible based on regional information, with one detector selected for each region; filtering based on the service type and tenant information of the health check task, ensuring that at least two detectors are available for each target region. Furthermore, the scheduling strategy in this embodiment can be adjusted and configured as needed.
[0081] The server is also used to identify the agent that deploys the target detector as the agent that obtains the detection configuration, and send detector confirmation information to the agent corresponding to the target detector.
[0082] In this embodiment, after the server determines the target probe, it can generate probe confirmation information, identify the agent that deploys the target probe as the agent that obtains the probe configuration, and send the probe confirmation information to the agent corresponding to the target probe.
[0083] The agent is specifically used to send a pull request to the server based on the confirmation information from the probe.
[0084] In this embodiment, after receiving the confirmation information from the probe, the agent can send a pull request to the server based on the confirmation information.
[0085] In conjunction with the above embodiments, in one implementation, the present invention also provides a health check system. Specifically, in this embodiment, the server is further configured to:
[0086] Interact with multiple caches or time-series databases to shard and store the detection results and / or the detection configuration in a distributed cache.
[0087] In this embodiment, the server interfaces with multiple caches or time-series databases. The server can store probe results and / or probe configurations in distributed caches. This method of storing probe results and / or probe configurations in distributed caches ensures that the components are stateless and enables horizontal scaling. In other words, this embodiment achieves horizontal scaling through data sharding, thereby supporting larger-scale business operations.
[0088] like Figure 2 As shown, Figure 2 This is a framework diagram of a distributed health check system proposed in one embodiment of the present invention. Figure 2 The distributed health check system includes at least: multiple agents (e.g., agent1 and agent2), multiple servers (e.g., server1, server2, and server3), and a database. The database includes MariaDB and a distributed cache cluster time series database. For example, the servers can shard the probe results and / or probe configurations in this database for storage in the distributed cache. In this embodiment, to improve the communication performance between the agents and servers, they can communicate via gRPC to send the health check ID (i.e., hc id) to the server, where hc stands for health check.
[0089] In conjunction with the above embodiments, in one implementation, the present invention also provides a health check system. Specifically, in this embodiment, there are multiple agents and multiple servers, with the multiple servers deployed in different regions, and the multiple agents in each region deployed in physical machines, virtual machines, and / or containers within that region.
[0090] The health check system in this embodiment further includes an external server, which includes at least an intelligent operation and maintenance system for receiving the mirrored detection results sent by the server, and performing performance and quality analysis on the detected object based on the mirrored detection results to obtain analysis results.
[0091] In this embodiment, after receiving the detection results, the server can send a mirror copy of the detection results to the intelligent operation and maintenance system. After receiving the mirrored detection results sent by the server, the intelligent operation and maintenance system can perform performance and quality analysis on the detected object based on the mirrored detection results and obtain the analysis results.
[0092] like Figure 3 As shown, Figure 3 This is a diagram illustrating an agent deployment architecture according to an embodiment of the present invention. Figure 3 In the example, Region A contains one server and three agents, deployed on physical machine A1 (host A1), virtual machine A1 (vm A1), and container A1, respectively. Region B also contains one server and three agents, deployed on physical machine B1 (host B1), virtual machine B1 (vm B1), and container B1, respectively. Figure 3 The topmost external server is the external server of the health check system, which can be an intelligent operation and maintenance system.
[0093] In conjunction with the above embodiments, in one implementation, the present invention also provides a health check system. Specifically, in this embodiment, the server is further configured to:
[0094] Receive the pull request and determine the agentID based on the hostname in the pull request;
[0095] Based on the agentID associated with the health check association table, the health check ID is obtained. The health check association table includes the mapping relationship between agentID and health check ID.
[0096] Based on the health check ID associated with the health check table, the agent queries the detection configuration corresponding to the health check ID and returns the detection configuration to the agent; the health check table includes the mapping relationship between health check IDs and detection configurations.
[0097] The detection configuration includes at least one of the following: detection frequency, health threshold, anomaly threshold, and detection type.
[0098] In this embodiment, the server can determine the corresponding probe configuration based on the hostname in the received pull request. Specifically, it determines the agentID (node) based on the hostname, then associates the agentID with the health check association table to obtain the health check ID, then associates the health check table with the health check ID to obtain the probe configuration corresponding to the health check ID, and returns the probe configuration to the agent.
[0099] In this embodiment, the server pre-stores a health check association table (TableNodeHcAssociation) and a health check table (Table healthcheck). The health check association table includes a mapping relationship between agentIDs and health check IDs, and the health check table includes a mapping relationship between health check IDs and probe configurations. Figure 4 As shown, Figure 4 This is a schematic diagram illustrating a mapping relationship according to an embodiment of the present invention. Figure 4 The node table (Table Node) includes a node id (i.e., agentID). Then, based on the node id and the health check association table (TableNodeHcAssociation), the hc id (i.e., health check ID) is determined. Finally, based on the hc id and the health check table (Tablehealthcheck), the corresponding probe configuration is determined.
[0100] In this embodiment, for all possible health check scenarios under the same abstraction, the detection configuration includes at least one of the following: detection frequency, health threshold, abnormality threshold, and detection type. The detection frequency is defined as interval, representing a detection every few seconds; the health threshold is defined as the number of consecutive normal checks considered normal; the abnormality threshold is defined as the number of consecutive normal checks considered abnormal; and the detection type is defined as type, including detection methods such as UDP, TCP, ICMP, and HTTP.
[0101] In another embodiment, the probe configuration may also include at least one of the following: source IP (src ip), destination IP (dest ip), source port (src port), destination port (dest port), custom probe parameters (options, such as HTTP URI probe and custom return status codes), callback handling method (callback, which can support HTTP, Kafka, etc.), encapsulation type (encapsulate type, such as flat without encapsulation, VXLAN encapsulation, Geneve encapsulation, etc.), encapsulation ID (encapsulate ID, such as VNI, etc.), and encapsulation options (encapsulate options).
[0102] like Figure 5 As shown, Figure 5 This is a module partitioning diagram of an agent and a server according to an embodiment of the present invention. Figure 5 In this context, the server component comprises the following five modules:
[0103] 1. Cluster Manager: Used to manage the information and scheduling of the probe machines. `node.load` represents the node load, calculated as Σ(m*n) by considering the probe method `m` (icmp:1, udp:1, tcp:2, http:3, ssl:5, http / 2:6, the specific value can be estimated based on CPU consumption and actual protocol packet volume) and the probe frequency `n` (qps, requests per second).
[0104] 2. Health Check Manager: Used to manage health check configurations. It unifies and abstracts all possible health check scenarios, with common fields including source IP (src ip), destination IP (dest ip), source port (src port), destination port (dest port), probe frequency (interval, probe every few seconds), health threshold (healthy threshold, considered normal after several consecutive normal checks), unhealthy threshold (unhealthy threshold, considered abnormal after several consecutive normal checks), type (probe type, ICMP, UDP, TCP, HTTP, etc.), options (custom probe parameters, such as specifying the URI for HTTP probes and custom return status codes), callback (callback handling method, supports HTTP, Kafka, etc.), encapsulation type (encapsulate type, flat (no encapsulation), VXLAN (encapsulation), Generate (encapsulate type, etc.), encapsulation ID (encapsulate ID, such as VNI), and encapsulation options.
[0105] In this way, by unifying and abstracting the detection parameters, agents or independent detection modules can perform detection processing according to different detection types and environments, ensuring adaptability to various environments and improving resource utilization.
[0106] 3. Callback Manager: Used for callback processing in case of state changes. Supports modular design and processing, primarily supporting (HTTP, Kafka, etc.).
[0107] 4. Detection Result Analyzer: Used for analysis and judgment of fault switching and recovery.
[0108] 5. Detection result cache management: Connects to various caches or time-series databases, saves reported results in shards (can be used as an independent component) and allows agents to pull detection configuration caches.
[0109] The callback manager, result analyzer, and cache manager can be integrated into separate components. The server components are independent of each other, allowing for horizontal scaling.
[0110] The agent includes the following four modules:
[0111] 1. Probe Manager: Initiates probe requests based on probe configuration.
[0112] 2. Result Reporting: Analyzing probe results on the server side is often CPU-intensive, requiring consideration of horizontal scaling. Therefore, for probe results generated or collected by the agent, the health check ID can be consistently hashed and then reported to the server in batches for unified processing.
[0113] 3. Detection Driver: Provides modular detection plugins based on the deployment environment. For example, OVS packet in.
[0114] 4. Configuration Puller: Retrieves the server's configuration probe. It performs consistent hashing on the hostname to obtain the address of a specific server and then requests the configuration, achieving load balancing and reducing the amount of data migration after a server failure.
[0115] The probe manager could be considered as a separate component, and gRPC could be used for communication between the agent and the server to improve performance.
[0116] Based on the same inventive concept, one embodiment of the present invention provides a health check method. (Reference) Figure 6 , Figure 6 This is a flowchart of a health check method provided in an embodiment of the present invention. The health check method of this embodiment is applied to the server side of the health check system described in any of the above embodiments, such as... Figure 6 As shown, this health check method includes at least the following steps:
[0117] Step S31: Receive a creation request sent by the console, the creation request being used to create a health check task;
[0118] Step S32: Based on the creation request, determine the target probe deployed in the agent, and send probe confirmation information to the agent corresponding to the target probe;
[0119] Step S33: Receive and, based on the pull request sent by the agent corresponding to the target probe, obtain the probe configuration corresponding to the health check task and return it to the agent, so that the agent can probe the target object through the target probe according to the probe configuration, obtain the probe result, and report the probe result to the server;
[0120] Step S34: Query the original detection results of the detection object, compare the original detection results with the detection results, and if there is a difference between the original detection results and the detection results, determine that the state of the detection object has changed.
[0121] Optionally, the method further includes:
[0122] If the state of the probed object changes, a callback is executed according to the callback handling method in the probe configuration to notify the corresponding business system.
[0123] Optionally, the method further includes:
[0124] It is also used to manage information about the detector, which includes at least: load information, tag information, and area information;
[0125] The step S32 above, "determining the target probe deployed in the agent," includes:
[0126] Based on minimum load information, tag information, and region information, target detectors for performing the health check task are selected according to the actual packet volume and CPU consumption.
[0127] The method further includes:
[0128] The agent that deploys the target probe is identified as the agent that obtains the probe configuration. Probe confirmation information is sent to the agent corresponding to the target probe, so that the agent sends a pull request to the server based on the probe confirmation information.
[0129] Optionally, the method further includes:
[0130] Interact with multiple caches or time-series databases to shard and store the detection results and / or the detection configuration in a distributed cache.
[0131] Optionally, there are multiple agents and multiple servers, with the multiple servers deployed in different regions, and the multiple agents in each region deployed in physical machines, virtual machines and / or containers in that region;
[0132] The system also includes: an external server, which at least includes: an intelligent operation and maintenance system;
[0133] The method further includes:
[0134] The detection results are mirrored, and the mirrored detection results are sent to the intelligent operation and maintenance system, so that the intelligent operation and maintenance system receives the mirrored detection results sent by the server, and performs performance and quality analysis on the detected object based on the mirrored detection results to obtain analysis results.
[0135] Optionally, the method further includes:
[0136] Receive the pull request and determine the agentID based on the hostname in the pull request;
[0137] Based on the agentID associated with the health check association table, the health check ID is obtained. The health check association table includes the mapping relationship between agentID and health check ID.
[0138] Based on the health check ID associated with the health check table, the agent queries the detection configuration corresponding to the health check ID and returns the detection configuration to the agent; the health check table includes the mapping relationship between health check IDs and detection configurations.
[0139] The detection configuration includes at least one of the following: detection frequency, health threshold, anomaly threshold, and detection type.
[0140] Based on the same inventive concept, one embodiment of the present invention provides a health check method. (Reference) Figure 7 , Figure 7 This is a flowchart of a health check method provided in an embodiment of the present invention. The health check method of this embodiment is applied to the agent in the health check system described in any of the above embodiments, such as... Figure 7 As shown, this health check method includes at least the following steps:
[0141] Step S41: Receive the probe confirmation message sent by the server; the probe confirmation message is generated by the server after receiving the creation request sent by the console and determining the target probe deployed in the agent based on the creation request, and the creation request is used to create a health check task;
[0142] Step S42: Send a pull request to the server to obtain the detection configuration corresponding to the health check task returned by the server;
[0143] Step S43: According to the detection configuration, the target detector is used to detect the target object, and the detection result is obtained. The detection result is then reported to the server so that the server can query the original detection result of the target object, compare the original detection result with the detection result, and determine that the state of the target object has changed if there is a difference between the original detection result and the detection result.
[0144] Optionally, there are multiple servers, and step S42 above may include:
[0145] Perform a consistent hash on the hostname to determine the first server among multiple servers;
[0146] The pull request is sent to the first server based on the address of the first server to obtain the probe configuration in the first server.
[0147] Optionally, there may be multiple servers, and the step S43 above, "reporting the detection results to the servers," may include:
[0148] Perform a consistent hash on the health check ID of the health check task to determine the second server among multiple servers;
[0149] The detection results are reported to the second server.
[0150] like Figure 8 As shown, Figure 8 This is an overall flowchart of a health check method proposed in one embodiment of the present invention. Figure 8 This example uses website monitoring as a case study for health checks. The processing flow is similar for other business scenarios such as load balancing, ECMP, dynamic DNS, and uptime monitoring services. This health check method is applied to a health check system, which at least includes: a console, a server, an agent, and an external server. The health check includes the following steps:
[0151] Step 100, Add and Register Probes: Here, "node" can be a probe. Deployed probes can be added via the console or automatically registered by the agent. When adding, set the region and label according to the actual situation for easy scheduling. The load of the created node is 0 by default. Repeat this step as needed until there are enough probes.
[0152] Step 101, Create a health check: The console creates a health check by making an HTTP request to the server.
[0153] Step 102, Select probe nodes by load: The server selects suitable probe nodes based on minimum load and label for each region. The selection strategy should be distributed across regions as much as possible, with one probe node per region. Filtering is done based on health check type and tenant information, ensuring at least two probe nodes are available. The scheduling algorithm can be adjusted as needed. After writing to the persistent database, the process returns.
[0154] Step 103: Pull health check configuration by hostname: The server can notify the agent via gRPC, allowing the agent to quickly inform of business changes. Alternatively, it can wait for the agent's scheduled task to pull the configuration via gRPC. The agent selects servers using a consistent hash algorithm to ensure even request distribution. The number of servers can be obtained through additional service discovery or a DNS system. After receiving the pull request, the server queries the node based on the hostname parameter, associates it with NodeHcAssociation, then associates it with healthcheck, and retrieves the health check configuration to be returned.
[0155] Step 104: Obtain probe results by consistent hashing of health check ID (report result by consistent hashing hc ID): After obtaining the configuration, the agent can perform probes using a coroutine or process within the same process, or by using an additional program, and then collect detailed probe results. The probe frequency is generally once every 1-60 minutes. The collected probe results are used to perform consistent hash calculation using the health check ID. The server list is obtained through an additional service discovery system or DNS, similar to step 103. After calculating the mapping relationship hc_id->server_ip, the mapping server_ip->[]{hc_id} is aggregated and reported to the server in batches.
[0156] Step 105: Analyze the result based on the health check status and result: The server queries the status of existing probe objects and compares it with the probe results. If there is a difference, a status judgment needs to be triggered. For example, if the original probe status was active, and the current probe result is abnormal (packet loss or keyword not found), the status judgment method can be the number of abnormal or normal probes / the total number of effective probes.
[0157] Step 106, Callback if the status is changed: If the status of the target changes, a callback is executed based on the callback content to notify the corresponding business system (i.e., Figure 8 (External server in the system). For example, email alerts, traffic switching, etc. In addition, the health check system can add other dimensions of observable monitoring to facilitate troubleshooting.
[0158] The health check method provided in this embodiment has advantages including at least the following:
[0159] 1. With unified model abstraction and pluggable agent detection methods, it can be applied to most scenarios, such as load balancing, ECMP, and website liveness status. It also supports centralized management.
[0160] 2. With support for custom callbacks, it can be integrated with various business or notification systems. As long as the health check system remains unchanged, stability issues will not be introduced.
[0161] 3. By using data sharding, consistent hashing, and more granular traffic scheduling calculations, horizontal scaling can be achieved, thereby supporting larger-scale business operations.
[0162] This will at least achieve: 1. Unified management, reducing maintenance costs; 2. Productization, providing the ability to perform standard health checks; 3. Undertaking large-scale business.
[0163] In other words, this embodiment improves the system's horizontal and business scalability through data sharding, unified model abstraction, and support for custom callbacks; it schedules probe machines based on actual packet volume and CPU consumption, resulting in more balanced probe traffic and load; it reports probe results after consistent hashing, which makes the server's COU load more balanced; the agent supports modular design, making it easy to extend to various standard probe methods; the probe model is uniformly abstracted, making it easy to maintain; and it supports custom callbacks, making it convenient for various business systems.
[0164] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of the present invention are not limited to the described order of actions, because according to the embodiments of the present invention, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions involved are not necessarily essential to the embodiments of the present invention.
[0165] As the method embodiments are basically similar to the system embodiments, the description is relatively simple, and relevant parts can be found in the description of the system embodiments.
[0166] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0167] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, apparatus, or computer program products. Therefore, embodiments of the present invention can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of the present invention can take the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0168] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0169] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0170] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0171] Although preferred embodiments of the present invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present invention.
[0172] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.
[0173] The above provides a detailed description of the health check system and method provided by the present invention. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The description of the above embodiments is only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A health check system, characterized in that, The system includes at least: a console, multiple agents, multiple servers, with the multiple servers deployed in different regions, and the multiple agents in each region deployed in physical machines, virtual machines and / or containers in that region; The console is used to send a creation request to the server, and the creation request is used to create a health check task; The server is used to determine the target probe deployed in the agent based on the creation request; The agent is used to send a pull request to the server to obtain the probe configuration corresponding to the health check task in the server. The agent is used to detect the target object through the target detection machine according to the detection configuration, obtain the detection result, and report the detection result to the server. The server is used to query the original detection results of the detected object, compare the original detection results with the detected results, and determine that the state of the detected object has changed if there is a difference between the original detection results and the detected results. This includes sending a pull request to the server to obtain the detection configuration corresponding to the health check task in the server, including: Perform a consistent hash on the hostname to determine the first server among multiple servers; The pull request is sent to the first server based on the address of the first server to obtain the probe configuration in the first server.
2. The health check system according to claim 1, characterized in that, The server is also used for: If the state of the probed object changes, a callback is executed according to the callback handling method in the probe configuration to notify the corresponding business system.
3. The health check system according to claim 1, characterized in that, There are multiple servers, and the detection results are reported to the servers, including: Perform a consistent hash on the health check ID of the health check task to determine the second server among multiple servers; The detection results are reported to the second server.
4. The health check system according to claim 1, characterized in that, The server is also used to manage the information of the detector, which includes at least: load information, tag information, and area information; Determining the target probe deployed in the agent includes: Based on minimum load information, tag information, and region information, target detectors for performing the health check task are selected according to the actual packet volume and CPU consumption. The server is also used to identify the agent that deploys the target detector as the agent that obtains the detection configuration, and send detector confirmation information to the agent corresponding to the target detector; The agent is specifically used to send a pull request to the server based on the confirmation information from the probe.
5. The health check system according to claim 1, characterized in that, The server is also used for: Interact with multiple caches or time-series databases to shard and store the detection results and / or the detection configuration in a distributed cache.
6. The health check system according to claim 1, characterized in that, The system further includes: an external server, the external server comprising at least: The intelligent operation and maintenance system is used to receive the detection results after mirroring sent by the server, and to perform performance and quality analysis on the detected object based on the detection results to obtain the analysis results.
7. The health check system according to any one of claims 1 to 6, characterized in that, The server is also used for: Receive the pull request and determine the agentID based on the hostname in the pull request; Based on the agentID associated with the health check association table, the health check ID is obtained. The health check association table includes the mapping relationship between agentID and health check ID. Based on the health check ID associated with the health check table, the agent queries the detection configuration corresponding to the health check ID and returns the detection configuration to the agent; the health check table includes the mapping relationship between health check IDs and detection configurations. The detection configuration includes at least one of the following: detection frequency, health threshold, anomaly threshold, and detection type.
8. A health examination method, characterized in that, The method, applied to the server side of the health check system according to any one of claims 1 to 7, comprises: Receive a creation request sent from the console, the creation request being used to create a health check task; Based on the creation request, the target probe machine deployed in the agent is determined, and probe machine confirmation information is sent to the agent corresponding to the target probe machine; The system receives and retrieves the detection configuration corresponding to the health check task based on the pull request sent by the agent corresponding to the target detector, and returns it to the agent so that the agent can detect the target object through the target detector according to the detection configuration, obtain the detection result, and report the detection result to the server. The original detection results of the object are queried, and the original detection results are compared with the detection results. If there is a difference between the original detection results and the detection results, it is determined that the state of the object has changed.
9. A health examination method, characterized in that, The method, applied to the agent in any one of the health check systems of claims 1 to 7, comprises: The server receives a probe confirmation message from the server. The probe confirmation message is generated by the server after receiving a creation request from the console and determining the target probe deployed in the agent based on the creation request. The creation request is used to create a health check task. Send a pull request to the server to obtain the detection configuration corresponding to the health check task returned by the server; According to the detection configuration, the target detector detects the target object, obtains the detection result, and reports the detection result to the server so that the server can query the original detection result of the target object, compare the original detection result with the detection result, and determine that the state of the target object has changed if there is a difference between the original detection result and the detection result.