Abnormality detection method and device, electronic device, and storage medium
By applying Shewhart quality control charts to continuous integration tasks, dividing tasks into subgroups and performing anomaly detection, the issues of accuracy and response speed in CI quality monitoring were resolved, achieving efficient anomaly identification and rapid response.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- MOORE THREADS TECH CO LTD
- Filing Date
- 2026-04-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing CI quality monitoring methods have poor anomaly identification accuracy, resulting in high false alarm rates, delayed anomaly response, poor CI benchmark adaptability, and a lack of closed-loop mechanisms, failing to meet enterprises' needs for rapid response and accurate identification of continuous integration tasks.
By combining Shewhart quality control charts, continuous integration tasks within the detection cycle are divided into task subgroups based on task completion time and task indicators. Anomalies are detected using control chart baseline values and anomaly detection conditions, and response processing corresponding to the anomaly level is executed when anomalies are found.
It improves the accuracy of anomaly identification, reduces the false alarm rate, realizes real-time detection and process assurance of CI anomalies, adapts to baseline deviations caused by code iteration, and reduces delays caused by manual intervention.
Smart Images

Figure CN122240502A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer software engineering technology, and in particular to an anomaly detection method and apparatus, electronic equipment, computer-readable storage medium, and computer program product. Background Technology
[0002] In the field of computer software engineering, developers in enterprises frequently integrate code into shared code repositories. Each integration involves automated task building and testing verification through Continuous Integration (CI) tasks.
[0003] In systems with multiple code repositories, multiple code version branches, and frequent code submissions, CI quality monitoring is an embedded quality assurance component in the process. This component is required to quickly and accurately identify quality risks during code integration and prevent code defects from flowing into subsequent stages.
[0004] However, the accuracy of anomaly identification in CI quality monitoring methods in related technologies is generally poor. Summary of the Invention
[0005] This disclosure provides an anomaly detection method and apparatus, electronic device, computer-readable storage medium, and computer program product.
[0006] In a first aspect, this disclosure provides an anomaly detection method, comprising: acquiring task information of continuous integration tasks in a continuous integration system, the task information including at least one of task request time, task dimension label, task completion time, task build result, task build duration, task test result, and task queuing duration; dividing continuous integration tasks with the same task dimension label and task completion time within a first detection period into at least one task subgroup based on the task information of the continuous integration tasks and the target task indicator to be analyzed; for any task subgroup, determining the indicator value of the task subgroup for the target task indicator based on the task information of the continuous integration tasks within the task subgroup; performing anomaly detection on the indicator value of the task subgroup based on the control chart baseline value of the target task indicator within the first detection period and preset anomaly detection conditions, and obtaining an anomaly detection result for the first detection period; the anomaly detection result is used to indicate whether an anomaly exists and the anomaly level when an anomaly exists; and, if the anomaly detection result indicates that an anomaly exists, performing an anomaly response processing corresponding to the anomaly level of the anomaly detection result.
[0007] Secondly, this disclosure provides an anomaly detection device, which includes:
[0008] The information acquisition module is used to acquire task information of continuous integration tasks in the continuous integration system. The task information includes at least one of the following: task request time, task dimension label, task completion time, task build result, task build duration, task test result, and task queuing duration.
[0009] The task grouping module is used to divide continuous integration tasks with the same task dimension labels and task completion time within the first detection period into at least one task subgroup based on the task information of the continuous integration tasks and the target task indicators to be analyzed.
[0010] The indicator value determination module is used to determine the indicator value of the task subgroup for the target task indicator based on the task information of the continuous integration tasks within the task subgroup.
[0011] An anomaly detection module is used to perform anomaly detection on the indicator values of the task subgroups based on the control chart baseline value of the target task indicator in the first detection period and the preset anomaly detection conditions, and obtain the anomaly detection result of the first detection period; the anomaly detection result is used to indicate whether an anomaly exists and the anomaly level when an anomaly exists.
[0012] An anomaly response module is used to perform anomaly response processing corresponding to the anomaly level of the anomaly detection result when the anomaly detection result indicates that an anomaly exists.
[0013] Thirdly, this disclosure provides an electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores one or more computer programs executable by the at least one processor, the one or more computer programs being executed by the at least one processor to enable the at least one processor to perform the above-described anomaly detection method.
[0014] Fourthly, this disclosure provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the above-described anomaly detection method.
[0015] Fifthly, this disclosure provides a computer program product that includes computer-readable code or a non-volatile computer-readable storage medium carrying computer-readable code, wherein when the computer-readable code is run in a processor of an electronic device, the processor in the electronic device performs the above-described anomaly detection method.
[0016] The embodiments provided in this disclosure can obtain task information of continuous integration tasks in the system, divide the continuous integration tasks within the detection period into task subgroups according to the task completion time and task indicators, determine the indicator values of the task subgroups according to the task information, perform anomaly detection on the indicator values of the task subgroups according to the control chart baseline value and anomaly detection conditions within the detection period to obtain anomaly detection results, and perform anomaly response processing corresponding to the anomaly level when anomalies exist, thereby improving the accuracy of anomaly identification.
[0017] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0018] The accompanying drawings are provided to further illustrate the present disclosure and form part of the specification. They are used together with the embodiments of the present disclosure to explain the disclosure and do not constitute a limitation thereof. The above and other features and advantages will become more apparent to those skilled in the art from the detailed description of exemplary embodiments with reference to the accompanying drawings, in which:
[0019] Figure 1 This is a flowchart of an anomaly detection method provided in an embodiment of this disclosure.
[0020] Figure 2 This is a schematic diagram of the module architecture of an anomaly detection system corresponding to an anomaly detection method provided in an embodiment of this disclosure.
[0021] Figure 3 This is a schematic diagram of the processing flow of an anomaly detection method provided in an embodiment of this disclosure.
[0022] Figure 4 This is a block diagram of an anomaly detection device provided in an embodiment of this disclosure.
[0023] Figure 5 This is a block diagram of an electronic device provided in an embodiment of the present disclosure. Detailed Implementation
[0024] To enable those skilled in the art to better understand the technical solutions of this disclosure, exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments of this disclosure to aid understanding. These should be considered merely exemplary. Therefore, those skilled in the art should recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description.
[0025] Where there is no conflict, the various embodiments of this disclosure and the features thereof in the embodiments may be combined with each other.
[0026] As used herein, the term “and / or” includes any and all combinations of one or more related enumerated entries.
[0027] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this disclosure. As used herein, the singular forms “a” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that when the terms “comprising” and / or “made of” are used in this specification, the presence of the stated feature, integral, step, operation, element, and / or component is specified, but the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof is not excluded. Words such as “connected” or “linked” are not limited to physical or mechanical connections but can include electrical connections, whether direct or indirect.
[0028] Unless otherwise specified, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art. It will also be understood that terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant art and this disclosure, and will not be interpreted as having an idealized or overly formal meaning, unless expressly so defined herein.
[0029] As mentioned earlier, in scenarios where there are multiple code repositories and multiple code version branches in the system, and there are frequent code submissions, a large number of continuous integration tasks need to be completed every day in the corresponding system, such as the K8S test cluster, including compilation tasks and testing tasks.
[0030] However, CI quality monitoring methods in related technologies typically have the following drawbacks:
[0031] Inaccurate CI metric monitoring: CI systems of related technologies often rely on fixed thresholds, such as "alarm when task build time > 60 minutes", and do not match dedicated control charts according to data types. For example, the same statistical method is used to mix build success rate (discrete) and build time (continuous), resulting in a high false alarm rate, such as more than 30%.
[0032] Delayed response to build anomalies: CI systems using related technologies only trigger general alerts after anomalies occur, lacking CI-specific response mechanisms such as build retries or pausing branch commits. This leads to a backlog of failed builds, consuming system resources such as GPU, CPU, and memory.
[0033] Poor CI baseline adaptability: After code iteration (such as driver updates, new features added to the operator library), the average build time shifts, but the control chart baseline (center line / control limit) in the relevant technology cannot be dynamically adjusted, causing the anomaly detection results to fail.
[0034] CI process and monitoring are disconnected: the relevant technologies lack a closed loop of "building data acquisition - control chart calculation - anomaly response", which requires manual intervention for analysis and has a large delay, such as more than 6 hours.
[0035] It is evident that the CI quality monitoring methods in related technologies cannot meet the needs of enterprises for rapid response, accurate identification, and quick resolution of anomalies in continuous integration tasks.
[0036] Based on this, embodiments of the present disclosure provide an anomaly detection method for continuous integration tasks, which combines CI quality inspection with Shewhart quality control charts from the field of statistical quality control. The continuous integration tasks within the detection period are divided into task subgroups based on task completion time and task indicators. Indicator values for each task subgroup are determined based on task information. Anomaly detection results are obtained by performing anomaly detection on the indicator values of the task subgroups based on the control chart baseline values and anomaly detection conditions within the detection period. Furthermore, when anomalies are detected, anomaly response processing corresponding to the anomaly level is executed, thereby improving the accuracy of anomaly identification.
[0037] According to embodiments of this disclosure, core quality indicators in the CI process, such as build success rate and build time, can be focused on to achieve real-time detection of CI anomalies and ensure the build process. The anomaly detection method according to embodiments of this disclosure can be applied to a continuous integration system and executed by terminal devices or servers and other electronic devices within the continuous integration system.
[0038] Figure 1 A flowchart illustrating an anomaly detection method provided in an embodiment of this disclosure. (Refer to...) Figure 1 The method includes:
[0039] In step S11, the continuous integration tasks whose completion time is within the first detection period are divided into at least one task subgroup.
[0040] In step S12, for any task subgroup in the at least one task subgroup, the indicator value of the task subgroup for the target task indicator is determined according to the task information of the continuous integration task in the task subgroup.
[0041] In step S13, based on the control chart baseline value of the target task indicator within the first detection period and the preset anomaly detection conditions, anomaly detection is performed on the indicator value of the task subgroup to obtain the anomaly detection result of the first detection period; the anomaly detection result is used to indicate whether there is an anomaly and the anomaly level when an anomaly exists.
[0042] In step S14, if the anomaly detection result indicates that an anomaly exists, an anomaly response process corresponding to the anomaly level of the anomaly detection result is executed.
[0043] For example, you can first obtain the task information of the continuous integration task in the continuous integration system. The task information includes at least one of the following: task request time, task dimension label, task completion time, task build result, task build duration, task test result, and task queuing duration.
[0044] In some possible implementations, the continuous integration system can be a CI cluster, such as a Kubernetes (K8S) cluster. A CI cluster includes various cluster resources, such as GPUs, CPUs, and memory. Information such as cluster resource utilization and CI task submission requests can be obtained through various components and / or tools. Based on this information, the cluster availability status and CI task queuing status can be determined.
[0045] Components and / or tools include, for example, Kubernetes (an open-source container orchestration platform), containerd (for managing the lifecycle of containers on a single node), GPU Operator (a component for configuring and managing GPU resources in a cluster), and Jenkins (a CI / CD tool for automating continuous delivery). It should be understood that those skilled in the art can configure the components and / or tools and specific data acquisition methods according to actual circumstances, and this disclosure does not impose any restrictions in this regard.
[0046] In some possible implementations, continuous integration tasks may include code compilation tasks and functional testing tasks. A compilation task is considered complete after the corresponding code is successfully compiled; a testing task is considered complete after the corresponding code is successfully compiled and corresponding functional tests are performed, and the task is considered complete after the test results are obtained.
[0047] In some possible implementations, the task dimension tags of continuous integration tasks are used to identify the continuous integration tasks, including at least one of code repository identifier, code version branch identifier, and hardware resource type identifier, such as task dimension tags like AI operator library A0, main branch, and GPU-TYPE1. It should be understood that those skilled in the art can set the task dimension tags of continuous integration tasks according to actual circumstances, and this disclosure does not impose any restrictions on this.
[0048] In some possible implementations, data from continuous integration tasks can be continuously collected through various components and / or tools to obtain initial task information, such as task build results (success / failure), task build duration, task test results (success / failure), and task queuing time. These components and / or tools include CI / CD tools like Jenkins, time-series databases like InfluxDB (a high-performance time-series database), and CI log parsers (tools / components specifically for processing CI process logs). It should be understood that those skilled in the art can configure the components and / or tools and specific data acquisition methods according to actual circumstances, and this disclosure does not impose any limitations in this regard.
[0049] In some possible implementations, after collecting the initial task information of the continuous integration task, the initial task information can be standardized, including task completion time, task dimension labels, task metric values, etc.; and the continuous integration task data can be cleaned to obtain a valid dataset of continuous integration tasks, which includes the task identifier ID and task information for each continuous integration task. The task information includes at least one of the following: task request time, task dimension labels, task build result, task build duration, task test result, and task queuing time. This disclosure does not limit the specific processing methods for task information standardization and data cleaning.
[0050] Among them, the task completion time is used to characterize the time point when the task is completed, that is, the time point when the code compilation task is successfully compiled, or the time point when the functional testing task obtains the task test result; the task dimension label is used to identify the continuous integration task, including at least one of the following: code repository label, code version branch label, and hardware resource type label; the task indicator value is used to characterize the value of the corresponding target task indicator. For example, if the target task indicator is the task build success rate, then the task indicator value is the probability value of the continuous integration task being successfully built.
[0051] Among them, the task request time is used to characterize the time point when the continuous integration task initiates the execution request; the task build result is used to characterize the execution result of the continuous integration task, including success or failure; the task build duration is used to characterize the duration between the start of execution and the completion of execution of the continuous integration task; the task test result is used to characterize the test result of the functional test task in the continuous integration task, including success or failure; and the task queuing duration is used to characterize the duration between the start of execution and the initiation of the execution request of the continuous integration task.
[0052] In some possible implementations, in step S11, continuous integration tasks with completion times within the first detection period are divided into at least one task subgroup. Step S11 includes: based on the task information of the continuous integration tasks in the continuous integration system and the target task metrics to be analyzed, dividing continuous integration tasks with the same task dimension labels and completion times within the first detection period into at least one task subgroup.
[0053] In some possible implementations, the target task metrics to be analyzed include a discrete first task metric and a continuous second task metric. The first task metric includes at least one of task construction success rate, task construction retry success rate, and task test pass rate. The second task metric includes at least one of average construction time and average queuing time.
[0054] Among them, the task build success rate is used to characterize the probability of a continuous integration task being successfully built; the task build retry success rate is used to characterize the probability of a successful retry after a continuous integration task fails to build; the task test pass rate is used to characterize the probability that a functional test task will pass the test; the average build time is used to characterize the average time between the start of a continuous integration task and its completion; and the average queuing time is used to characterize the average time between the submission of a request and the start of a continuous integration task.
[0055] It should be understood that the first and second task indicators are not limited to these. Those skilled in the art can set various types of first and second task indicators according to the actual situation, and this disclosure does not impose any restrictions on this.
[0056] By dividing continuous integration tasks into task subgroups, the "subgroup mean" rather than "single data" can be used as input in subsequent processing. This can smooth out random fluctuations, reflect the true trend of the current period, avoid the benchmark being interfered with by single abnormal data, improve the accuracy of anomaly identification, and reduce the probability of false alarms.
[0057] In some possible implementations, the number of tasks in the task subgroups of the discrete first task indicator and the continuous second task indicator differs. Specifically, for the discrete first task indicator, the number of tasks in the task subgroup can be set higher, such as 100, to improve the accuracy of anomaly identification and better suit high-frequency construction scenarios. For the continuous second task indicator, the number of tasks in the task subgroup can be set lower, such as 5, to improve the speed of anomaly detection. It should be understood that those skilled in the art can set the number of tasks in the task subgroups of the first and second task indicators according to actual circumstances, and this disclosure does not impose any limitations on this.
[0058] In some possible implementations, the differences between continuous integration tasks with different task dimension labels may be significant. Continuous integration tasks with the same task dimension labels can be analyzed to improve the accuracy of anomaly identification.
[0059] In some possible implementations, a detection period can be set. This detection period can be a set time period, such as 12:00-14:00, or 2 hours; or it can be a set number of task subgroups, such as defining one or more task subgroups as a detection period. This disclosure does not limit this.
[0060] In step S11, the current detection period can be recorded as the first detection period. The number of tasks in the task subgroup is determined according to the target task indicators. Continuous integration tasks with the same task dimension label and task completion time within the first detection period can be divided into at least one task subgroup so that subsequent analysis can be performed on a task subgroup basis, thereby improving the accuracy of anomaly identification.
[0061] For any task subgroup, in step S12, the indicator value for the target task metric is determined based on the task information of the continuous integration tasks within that subgroup. The method for determining the indicator value differs for different target task indicators. For discrete first task indicators, such as task build success rate, the corresponding indicator value can be determined based on the ratio of the number of successfully built tasks in the subgroup to the total number of tasks in the subgroup. For continuous second task indicators, such as average build time, the corresponding indicator value can be determined based on the average build time of the continuous integration tasks within the subgroup.
[0062] In some possible implementations, in step S13, based on the control chart baseline value of the target task indicator in the first detection period and the preset anomaly detection conditions, anomaly detection is performed on the indicator value of the task subgroup to obtain the anomaly detection result of the first detection period; the anomaly detection result is used to indicate whether an anomaly exists and the anomaly level when an anomaly exists.
[0063] The control chart baseline value is the central reference standard value of the control chart, used to characterize the target mean or baseline level of the indicator under normal steady state. The control chart baseline value of the target task indicator in the first detection period can be a fixed baseline value, or it can be determined based on the control chart baseline value of the previous detection period and the indicator values of the task subgroups for the target task indicator in the first detection period, so as to realize the dynamic updating of the control chart baseline value and improve the accuracy of anomaly detection.
[0064] In some possible implementations, the discrete first task indicator can be a p-control chart in the quality control chart, with the control chart baseline including the center line, upper control limit, and lower control limit; the continuous second task indicator can be an XR control chart in the quality control chart, with the control chart baseline including the center line, upper control limit, and lower control limit of the range chart and the mean chart.
[0065] In some possible implementations, one or more anomaly detection conditions can be set, including at least one of the following: the indicator value of a task subgroup exceeds the control limit; the indicator values of M1 consecutive task subgroups are on the same side of the center line; the indicator values of M2 consecutive task subgroups are monotonically increasing or monotonically decreasing; and at least one of the indicator values of M4 consecutive task subgroups exceeds twice the standard deviation range on the same side of the center line. M1, M2, M3, and M4 are integers greater than 1, and M4... <M3。
[0066] In some possible implementations, anomaly detection is performed on the indicator values of task subgroups based on control chart baseline values and anomaly detection conditions. This identifies task subgroups that trigger one or more anomaly detection conditions. The anomaly level is then determined based on the risk weights of the anomaly detection conditions triggered by the task subgroups, yielding the anomaly detection results for the first detection period. These results indicate whether an anomaly exists and, if so, its level.
[0067] In some possible implementations, in step S14, if the anomaly detection result indicates the presence of an anomaly, an anomaly response is executed corresponding to the anomaly level of the anomaly detection result. The higher the anomaly level, the more stringent the corresponding anomaly response, thereby reducing the resource consumption of the anomaly-related continuous integration task and mitigating its impact on code version branches.
[0068] According to embodiments of this disclosure, task information of continuous integration tasks in the system can be obtained, and continuous integration tasks within the detection period can be divided into task subgroups according to task completion time and task indicators. The indicator values of the task subgroups can be determined according to the task information. Anomaly detection results are obtained by performing anomaly detection on the indicator values of the task subgroups according to the control chart baseline value and anomaly detection conditions within the detection period. When anomalies exist, anomaly response processing corresponding to the anomaly level is performed, thereby improving the accuracy of anomaly identification.
[0069] The anomaly detection method according to embodiments of this disclosure will now be described in detail.
[0070] As mentioned above, in step S11, based on the task information of the continuous integration task and the target task indicators to be analyzed, continuous integration tasks with the same task dimension labels and task completion time within the first detection period can be divided into at least one task subgroup.
[0071] In some possible implementations, for any task subgroup, in step S12, the indicator value for the target task indicator of the task subgroup is determined based on the task information of the continuous integration tasks within that task subgroup. The method for determining the indicator value differs for different target task indicators.
[0072] For discrete-type first task indicators.
[0073] In the example, the metric value for task build success rate. Represented as:
[0074] = ×100%; among which, This indicates the number of continuous integration tasks that were successfully built within the task subgroup. = indicates the total number of continuous integration tasks within the task subgroup, for example, 100.
[0075] In the example, the metric value for task build retry success rate. Represented as:
[0076] = ×100%; among which, This indicates the number of continuous integration tasks within the task subgroup that successfully retried and built. This indicates the total number of continuous integration tasks that are retried within the task subgroup, for example, 100.
[0077] In the example, the metric value for task test pass rate. Represented as:
[0078] = ×100%; among which, This indicates the number of functional test cases that passed in the continuous integration tasks within the task subgroup; This indicates the total number of functional test cases for continuous integration tasks within a task subgroup, for example, the sum of the number of functional test cases in 100 continuous integration tasks.
[0079] For continuous second task indicators.
[0080] In the example, the metric value for average build time Represented as:
[0081] = ; represents the average build time of continuous integration tasks in the i-th task subgroup; k represents the number of task subgroups.
[0082] In the example, the metric value for average queuing time Represented as:
[0083] = ; represents the average queuing time of continuous integration tasks in the i-th task subgroup; k represents the number of task subgroups.
[0084] The baseline value of the control chart for the target task indicator in the first detection period can be a fixed baseline value, or it can be determined based on the baseline value of the control chart in the previous detection period and the indicator values of the task subgroup for the target task indicator in the first detection period, so as to realize the dynamic updating of the baseline value of the control chart and improve the accuracy of anomaly detection.
[0085] In some possible implementations, the target task indicator includes a discrete first task indicator, the first control chart of the first task indicator is a p-control chart, and the first task indicator corresponds to a first task subgroup. The anomaly detection method according to embodiments of this disclosure further includes:
[0086] Based on the first indicator value of the first task indicator for the first task subgroup within the first detection period, a first mean value of the first task indicator is determined; based on the first mean value, a control chart baseline value of the first control chart is determined, wherein the control chart baseline value of the first control chart includes at least one of a first centerline, a first upper control limit, and a first lower control limit.
[0087] For example, for a discrete first task indicator, there may be multiple first task subgroups within the first detection period. The first mean of the first task indicator can be determined based on the first indicator value of each first task subgroup for the first task indicator; and the first mean is determined as the first center line in the control chart baseline value; then the first upper control limit and the first lower control limit are determined based on the first center line.
[0088] In the example, the first center line Represented as: = ,in, This represents the first index value of the i-th first task subgroup within the first detection period. represents the number of tasks in the i-th first task subgroup within the first detection period; k represents the number of first task subgroups within the first detection period.
[0089] In this example, the dynamic standard deviation σ of the p-control chart is expressed as: σ = , where n represents the number of tasks in the first task subgroup, for example, 100.
[0090] Accordingly, the first upper control limit Represented as: = +3σ= +3 First lower control limit Represented as: =max(0, -3σ)=max(0, - ); where the formula represents -3 First lower control limit when it is negative Take 0.
[0091] This method enables dynamic updating of control chart baseline values, improving the accuracy of anomaly detection.
[0092] In some possible implementations, the target task indicator includes a continuous second task indicator, the second control chart of the second task indicator is an XR control chart, and the second task indicator corresponds to a second task subgroup. The anomaly detection method according to embodiments of this disclosure further includes:
[0093] Based on the second indicator value of the second task subgroup for the second task indicator within the first detection period, determine the second mean of the subgroup range of the second task subgroup and the third mean of the second task indicator; based on the second mean and the third mean, determine the control chart baseline value of the second control chart, wherein the control chart baseline value of the second control chart includes at least one of the second center line, the second upper control limit and the second lower control limit of the range chart, and at least one of the third center line, the third upper control limit and the third lower control limit of the mean chart.
[0094] For example, for a continuous second task indicator, there may be multiple second task subgroups within the first detection period. For the range plot (R plot) and mean plot (X plot) in the XR control chart, the second mean of the subgroup range and the third mean of the second task indicator can be determined based on the second indicator value of each second task subgroup for the second task indicator.
[0095] Among them, the subgroup range of the second task subgroup can be the difference between the maximum and minimum values of the indicator values of the continuous integration tasks within the second task subgroup; the second mean can be the average of the subgroup ranges of multiple second task subgroups.
[0096] In some possible implementations, the second mean can be determined as the second center line of the range plot, and the third mean can be determined as the third center line of the mean plot; then the second upper control limit, the second lower control limit, the third upper control limit, and the third lower control limit can be determined based on the second center line and the third center line, respectively.
[0097] In the example, the second center line Represented as: = = ;in, The second mean of the subgroup range; represents the subgroup range of the i-th second task subgroup within the first detection period; k represents the number of second task subgroups within the first detection period.
[0098] In the example, the second upper control limit Represented as: = × Second lower control limit Represented as = × Where D3 represents the lower control limit coefficient of the R-chart; D4 represents the upper control limit coefficient of the R-chart; D3 and D4 are associated with the number of tasks P in the second task subgroup.
[0099] In the example, the third center line Represented as: = ;in, This represents the index value of the i-th second task subgroup within the first detection period.
[0100] In the example, the third upper control limit Represented as: = + × Third lower control limit Represented as = - × Where A2 represents the control limit coefficient of the X diagram, which is related to the number of tasks n in the second task subgroup.
[0101] In the example, when the number of tasks in the second task subgroup is n=3, A2=1.023; d2=1.693; D3=0; D4=2.574.
[0102] In the example, when the number of tasks in the second task subgroup is n=4, A2=0.729; d2=2.059; D3=0; D4=2.282.
[0103] In the example, when the number of tasks in the second task subgroup is n=5, A2=0.577; d2=2.326; D3=0; D4=2.114.
[0104] In the example, when the number of tasks in the second task subgroup is n=6, A2=0.483; d2=2.534; D3=0; D4=2.004.
[0105] Where d2 represents the divisor of the standard deviation of the range estimate, used to calculate the process standard deviation σ.
[0106] Both the p-control chart and the XR control chart follow the "3σ principle" (control limit = center line ± 3σ), and their mathematical logic is the same. The only difference is that the σ of the p-control chart is based on the standard deviation of the binomial distribution proportion, while the σ of the XR control chart is based on the process standard deviation of the range estimate (simplified by using the control limit coefficient). It should be understood that those skilled in the art can set specific calculation methods and values for each parameter according to actual circumstances, and this disclosure does not impose any restrictions on this.
[0107] In this way, the baseline value of the control chart can be automatically determined according to the type of task indicator, so that the abnormality can be determined based on the baseline value, thereby improving the accuracy of abnormality detection.
[0108] In some possible implementations, to address build time offsets caused by code iterations, such as the extension of build time due to the addition of new operators, the current control chart baseline value for the first detection period can be dynamically updated based on the control chart baseline value of the historical detection period. This allows the control chart baseline to adapt to the fluctuations in the indicator value while maintaining a certain stability of the control chart baseline and reducing abnormal fluctuations in the baseline value.
[0109] In some possible implementations, the anomaly detection method according to embodiments of this disclosure further includes:
[0110] Based on the indicator values of the at least one task subgroup for the target task indicator, the initial control chart baseline value of the first detection period is determined; based on the control chart baseline value of the second detection period, the initial control chart baseline value of the first detection period, and a preset forgetting factor, the control chart baseline value of the first detection period is determined, wherein the second detection period is the previous detection period of the first detection period.
[0111] For example, for the current first detection period, the method described above can be used to determine the initial control chart baseline value for the first detection period based on the indicator values of at least one task subgroup for the target task indicator within the first detection period. Then, the control chart baseline value of the previous detection period (hereinafter referred to as the second detection period) and a pre-set forgetting factor can be obtained to determine the control chart baseline value for the first detection period.
[0112] The forgetting factor λ represents the weighting coefficient of the control chart baseline value from the previous detection period, used to control the influence of historical data on the current centerline, where 0 < λ < 1. The closer λ is to 1, the higher the weighting of the control chart baseline value from the previous detection period, the smaller the influence of the initial control chart baseline value in the current first detection period, and the more stable the control chart baseline value. The closer λ is to 0, the greater the influence of the initial control chart baseline value in the current first detection period, the more sensitive the baseline, but the more prone it is to fluctuation.
[0113] In Continuous Integration (CI) processes, the process typically exhibits characteristics of "high frequency but gradual change." Code commits are frequent, but changes in build time and success rate are generally gradual. For example, adding a new operator might increase build time by 2-3 minutes per month, rather than a sudden jump from 45 minutes to 60 minutes. In this case, to balance "historical stability" and "current adaptability," the stability of the baseline should be prioritized in CI scenarios. Therefore, the forgetting factor λ can be set relatively high, for example, 0.7-0.9. It should be understood that those skilled in the art can set the forgetting factor λ according to the actual situation, and this disclosure does not impose any restrictions on it.
[0114] In some possible implementations, the center line in the control chart baseline of the first detection cycle can be represented as:
[0115] =λ +(1-λ) Formula 1.
[0116] In Formula 1, The center line of the first detection period is used to determine whether the CI data of the current detection period is abnormal. t indicates that the first detection period is the t-th detection period. This represents the centerline of the second detection period, which is also the centerline of the (t-1)th detection period. It is used to inherit the stability of the historical benchmark and avoid drastic changes in the benchmark due to fluctuations in the current data.
[0117] In Formula 1, This represents the initial centerline of the first detection cycle, corresponding to the one above. , or This is used to provide feedback on the actual status of the current CI process, enabling the baseline to track process changes, such as mean shifts caused by code iterations. Furthermore, by dividing continuous integration tasks into task subgroups, subsequent processing can use the "subgroup mean" rather than "single data" as input, smoothing out random fluctuations, reflecting the true trend of the current period, avoiding interference from single abnormal data on the baseline, improving the accuracy of anomaly identification, and reducing the probability of false alarms.
[0118] In the example, in the CI scenario of a continuous integration system, code iteration (such as adding features to the operator library or compiler optimization) and changes in cluster state can cause the CI metric baseline to shift (e.g., the average build time increases from 45 minutes to 48 minutes). If a fixed centerline is used (e.g., always using 45 minutes as the baseline), it will lead to a large number of "false positives" (actually normal 48-minute builds are judged as abnormal) or "false negatives" (truly abnormal 44-minute builds are not identified because the baseline is invalid).
[0119] The core function of the aforementioned centerline update is to enable the CI indicator benchmark (centerline) to adaptively track process changes through dynamic calculation of "historical benchmark weighting + current data feedback", maintain the accuracy of anomaly detection, and avoid frequent manual adjustments to the benchmark.
[0120] In some possible implementations, the process standard deviation σ of the control chart can be updated in a similar manner, expressed as:
[0121] = Formula 2.
[0122] In Formula 2, This represents the dynamic standard deviation of the first testing period, used to calculate the control limits for the first testing period. This represents the dynamic standard deviation of the second testing period, which is the dynamic standard deviation of the previous testing period. The deviation between the initial centerline of the first detection period and the centerline obtained by Formula 1 is used to reflect the degree of deviation between the initial centerline of the current subgroup and the benchmark; λ represents the forgetting factor, which is the weight of the historical standard deviation and is consistent with λ in Formula 1 to ensure benchmark consistency.
[0123] In some possible implementations, after obtaining the dynamic standard deviation of the first detection period, the dynamic standard deviation can be used as a basis for further analysis. and center line Update control limits, represented as:
[0124] Upper control limit: = +3 Formula 3.
[0125] Lower control limit: =max(0, -3 ) Formula 4.
[0126] In this way, all control chart baseline values for the first detection cycle can be obtained.
[0127] The processing method of dynamically updating the baseline periodically can avoid baseline lag and has better adaptability to scenarios with high-frequency code iteration. Moreover, it can track the data fluctuation trend and has better adaptability to the changes in the fluctuation of task construction time, thereby improving the accuracy of anomaly recognition in continuous integration tasks.
[0128] To implement anomaly detection for continuous integration tasks, one or more anomaly detection conditions need to be preset to detect whether the metric values of task subgroups trigger these anomaly detection conditions.
[0129] In some possible implementation manners, the anomaly detection conditions include at least one of a first anomaly condition, a second anomaly condition, a third anomaly condition, and a fourth anomaly condition.
[0130] Among them, the first anomaly condition indicates that the metric value of a task subgroup exceeds the control limit range of the control chart baseline value of the target task metric; the second anomaly condition indicates that the metric values of M1 consecutive task subgroups are on the same side of the center line of the control chart baseline value of the target task metric; the third anomaly condition indicates that the metric values of M2 consecutive task subgroups are monotonically increasing or decreasing; the fourth anomaly condition indicates that among the metric values of M3 consecutive task subgroups, M4 metric values exceed the range of twice the standard deviation on the same side of the center line, where M1, M2, M3, and M4 are integers greater than 1, and M4 < M3.
[0131] For example, the first anomaly condition can also be called a point out of bounds, which is used to indicate that the metric value of a task subgroup exceeds the control limit range, that is, exceeds the upper control limit or the lower control limit, and is expressed as:
[0132] =I( >UCL∨ <LCL) Formula 5.
[0133] In Formula 5, represents the value of the first anomaly condition; I( ) represents an indicator function. If the condition in the parentheses holds, I( ) = 1 (anomaly); otherwise, I( ) = 0 (normal); represents the metric value of the currently analyzed task subgroup; ∨ represents logical OR, that is, either of the two conditions is satisfied; UCL represents the upper control limit; LCL represents the lower control limit, and the control limit range is the range between the upper control limit and the lower control limit.
[0134] In some possible implementation manners, risk weights corresponding to each anomaly detection condition can be set to represent the importance degree of the anomaly detection condition. Among them, for the first anomaly condition, its risk weight can be set relatively high, for example =4. The present disclosure is about the risk weight There is no restriction on the specific value.
[0135] In some possible implementation manners, when the first abnormal condition is triggered, it indicates that a sudden abnormality is detected, which characterizes that a special reason variation occurs in the continuous integration task building process and immediate investigation is required. Among them, it is judged whether "the metric value of the currently analyzed task subgroup exceeds the control limit range", and the control limit is the "steady state interval" (including 99.73% of normal data) calculated based on historical data. Exceeding it represents the occurrence of a "sudden abnormality". For example, the task build duration = 80 min > UCL = 65 min, or the task build success rate = 85% < LCL = 90%.
[0136] In some possible implementation manners, the second abnormal condition is the condition for offset abnormality detection, indicating that the metric values of M1 consecutive task subgroups are on the same side of the center line, which is expressed as:
[0137] =I( ∨ ) Formula 6.
[0138] In Formula 6, represents the value of the second abnormal condition; t represents the currently analyzed task subgroup, ~t represents the M1 task subgroups from the (M1 - 1)th task subgroup before the tth task subgroup to the tth task subgroup; represents the metric value of the ith task subgroup among the M1 task subgroups; CL represents the center line. M1 is, for example, taken as 7, and there is no restriction on the specific value of M1 in this disclosure.
[0139] In some possible implementation manners, the risk weight of the second abnormal condition can be set to medium, for example =2. There is no restriction on the specific value of the risk weight in this disclosure.
[0140] In some possible implementation manners, when the second abnormal condition is triggered, it indicates that an offset abnormality is detected, which characterizes that the data mean value in the continuous integration task building process may shift. Among them, whether all M1 consecutive CI metric data points fall on the same side of the center line represents that the CI process benchmark has "shifted", such as the mean build time increasing after code iteration but the benchmark not being updated. For example, the task build duration is continuously 7 points > CL = 45 min (heavy GPU node load), or the task test pass rate is continuously 7 points < CL = 98% (test case update).
[0141] In some possible implementation manners, the third abnormal condition is the condition for trend abnormality detection, indicating that the metric values of M2 consecutive task subgroups are monotonically increasing or monotonically decreasing, which is expressed as:
[0142] =I( ∨ ) Formula 7.
[0143] In Formula 7, This indicates the value of the third anomaly condition; t represents the current task subgroup being analyzed. ~t represents the M2-1th task subgroup before the tth task subgroup, up to the tth task subgroup, for a total of M2 task subgroups; This represents the index value of the i-th task subgroup among M2 task subgroups; This represents the index value of the (i-1)th task subgroup among the M2 task subgroups. For example, M2 can be 7, but this disclosure does not restrict the specific value of M2.
[0144] In some possible implementations, the risk weight of the third anomaly condition can be... Set to medium, for example =2. This disclosure addresses risk weighting. There are no restrictions on the specific values that can be obtained.
[0145] In some possible implementations, a third anomaly condition is triggered, indicating that a trend anomaly has been detected. The judgment is whether "M2 consecutive CI indicator data points show a monotonically increasing / decreasing trend". A monotonous trend indicates that there is a "gradual anomaly" in the data during the continuous integration task building process, such as the accumulation of code defects or the gradual increase of cluster load. For example, the task building time increases from 40 minutes to 55 minutes for 7 consecutive points, or the task building success rate decreases for 7 consecutive points.
[0146] In some possible implementations, the fourth anomaly condition is the significant fluctuation anomaly condition, indicating that among M3 consecutive task subgroups, M4 index values exceed twice the standard deviation range on the same side of the center line, expressed as:
[0147] =I( ∨ ) Formula 8.
[0148] In Formula 8, This indicates the value of the fourth anomaly condition; t represents the current task subgroup being analyzed. ~t represents the M3-1th task subgroup before the tth task subgroup, up to the tth task subgroup, for a total of M3 task subgroups; This represents the index value of the i-th task subgroup among M3 task subgroups. For example, M3 may be 3, and M4 may be 2. This disclosure does not impose any restrictions on the specific values of M3 and M4.
[0149] In some possible implementations, the risk weight of the fourth anomaly condition can be... Set to medium, for example =3. This disclosure addresses risk weighting. There are no restrictions on the specific values that can be obtained.
[0150] In some possible implementations, a fourth anomaly condition is triggered, indicating the detection of significant fluctuation anomalies. The determination is made whether "at least two out of three consecutive CI indicator data points exceed the 'centerline ± 2σ' range," signifying "significant fluctuation" in the CI process. Here, "centerline ± 2σ" is the "warning interval" of the control chart (containing 95.45% of normal data). If two out of three consecutive points exceed the warning interval, it indicates that the fluctuation has exceeded the normal range and requires attention.
[0151] In this way, multiple anomaly risks can be detected based on multiple anomaly detection conditions, improving the comprehensiveness and accuracy of anomaly identification.
[0152] In some possible implementations, in step S13, based on the control chart baseline value of the target task indicator within the first detection period and the preset anomaly detection conditions, anomaly detection is performed on the indicator value of the task subgroup to obtain the anomaly detection result of the first detection period; the anomaly detection result is used to indicate whether an anomaly exists and the anomaly level when an anomaly exists.
[0153] In some possible implementations, the anomaly detection result includes the anomaly level, and step S13 may include: determining the anomaly detection conditions triggered by the task subgroup within the first detection period for the target task indicator; determining the anomaly score of the target task indicator based on the anomaly detection conditions triggered by the task subgroup and the risk weight of each anomaly detection condition; if there are multiple target task indicators, determining the overall anomaly score of the first detection period based on the indicator weight and anomaly score of each of the multiple target task indicators; and determining the anomaly level of the first detection period based on the overall anomaly score, wherein the anomaly detection result includes the anomaly level.
[0154] For example, for any target task indicator, the above judgment method can be used to determine the anomaly detection conditions triggered by the task subgroup within the first detection period. Then, based on the anomaly detection conditions triggered by the task subgroup and the risk weight of each anomaly detection condition, the anomaly score of the target task indicator can be determined. For example, if the first anomaly condition and the fourth constant condition are triggered within the first detection period, the value is 1; the second anomaly condition and the third constant condition are not triggered, and the value is 0. The risk weights of the first anomaly condition and the fourth constant condition are 4 and 3, respectively. Then, the anomaly score of the target task indicator is 4 + 3 = 7.
[0155] In some possible implementation manners, if there are multiple target task metrics, the overall anomaly score may be determined according to the metric weights and anomaly scores of each target task metric. Among them, the metric weight of a more important target task metric is higher. For example, the task construction success rate is a core task metric that affects the effectiveness of the CI process, and its metric weight is 1.5; the task construction duration is a secondary task metric that affects the efficiency of continuous integration task construction, and its metric weight is 1. The anomaly scores may be weighted according to the metric weights of each target task metric, and the weighted sum of the anomaly scores is used as the overall anomaly score of the system.
[0156] In some possible implementation manners, according to the anomaly score range in which the overall anomaly score S falls, the anomaly level of the first detection period is determined, and the anomaly detection result includes the anomaly level.
[0157] In some possible implementation manners, the anomaly levels include at least two of the first level, the second level, and the third level. The third level is higher than the second level, the second level is higher than the first level, and the third level is higher than the first level. For example, the anomaly score range 0 < S ≤ 5 corresponds to the first level, indicating a minor anomaly; the anomaly score range 5 < S ≤ 12 corresponds to the second level, indicating a moderate anomaly; the anomaly score range S > 12 corresponds to the third level, indicating a severe anomaly.
[0158] It should be understood that those skilled in the art can set the corresponding relationship between the anomaly score range and the anomaly level according to the number of target task metrics and the actual situation of the system, and the present disclosure does not limit this.
[0159] In the example, two CI metrics, "task construction success rate" and "average construction duration", are used to illustrate the CI anomaly response mechanism. If the CI metrics are expanded, the corresponding anomaly score range needs to be adjusted according to the actual situation. Metric weights: construction success rate weight (core, affecting effectiveness) = 1.5; construction time weight (secondary, affecting efficiency) = 1. Trigger identification: (k = 1, 2, 3, 4, triggered is 1, not triggered is 0). Total anomaly score formula: S = .
[0160] Example 1: The construction success rate is not triggered by any criterion, and the construction time is only triggered by condition 2, S = 1.5 + 1 = 2. Anomaly level determination: 0 < 2 ≤ 5, triggering the first level, minor anomaly, and performing the corresponding response action.
[0161] Example 2: The construction success rate triggers anomaly conditions 1 and 4, and the construction time triggers anomaly condition 1, S = 1.5 + 1 =14.5. Anomaly level assessment: 14.55 This triggers a Level 3 severe exception, in which the corresponding response action is executed.
[0162] In this way, an anomaly score can be determined based on the triggered anomaly detection conditions, thereby determining the anomaly level so that appropriate processing can be carried out subsequently, improving the accuracy and targeting of anomaly identification.
[0163] In some possible implementations, the task dimension label includes at least one of the following: the code repository identifier, the code version branch identifier, and the hardware resource type identifier for the continuous integration task.
[0164] In some possible implementations, the anomaly response processing corresponding to the first level includes: marking the abnormal task subgroup and the task dimension label of the abnormal task in the abnormal task subgroup, and sending a first alarm notification to the first management device.
[0165] The abnormal response handling corresponding to the second level includes: retrying to construct the abnormal task in the abnormal task subgroup, updating the display page of the target task indicator, and sending a second alarm notification to the first device of the person who submitted the abnormal task and the first management device.
[0166] The abnormal response handling corresponding to the third level includes at least one of the following: stopping the task submission of the code version branch corresponding to the abnormal task in the abnormal task subgroup, restarting the build node of the abnormal task in the continuous integration system, rolling back the code version branch code version to a stable version, and sending a third alarm notification to the second device of the person above the person who submitted the abnormal task and the management device above the first management device.
[0167] For example, under the first level of minor anomalies, the continuous integration task build process in the system can be maintained without interruption. The corresponding anomaly response handling includes: marking the task subgroup that triggered the anomaly detection condition as an abnormal task subgroup, treating the continuous integration tasks in the abnormal task subgroup as abnormal tasks to be reviewed, and recording the task dimension labels of the abnormal tasks in the abnormal task subgroup, including code repository identifier, code version branch identifier, and hardware resource type identifier.
[0168] In some possible implementations, the notification method under the first level is as follows: send a first alarm notification to the first management device to inform the corresponding personnel to handle it. The first management device can be the device corresponding to the CI system administrator. The first alarm notification can use a non-real-time notification method, such as email notification, and the processing time limit for the first alarm notification can be set to a relatively long time range, such as within 24 hours.
[0169] In some possible implementations, even under the second level of moderate anomalies, the continuous integration task build process in the system can be maintained without interruption. The corresponding anomaly response handling includes: retrying the build of the abnormal task subgroup multiple times, for example, up to 3 times, and prioritizing the allocation of idle system resources, such as idle GPU nodes, to the retrying tasks; and updating the display page of target task metrics in real time, such as the task build success rate and task build time in the CI monitoring dashboard.
[0170] In some possible implementations, the notification method under the second level is as follows: a second alarm notification is sent to the first device of the person who submitted the abnormal task and the first management device. The second alarm notification can be immediate, for example, by notifying the R&D personnel (first device) and CI system administrators (first management device) who submitted the abnormal task via instant messaging software. The processing time limit for the second alarm notification can be set to a short time range, such as within 2 hours.
[0171] In some possible implementations, at the third level of a critical exception, based on the code version branch identifier of the exception task in the exception task subgroup, the commit of the continuous integration task for the corresponding code version branch (e.g., the main branch) can be suspended; the resources of the build node in the continuous integration system for the build exception task, such as video memory resources, can be forcibly released, and the build node can be restarted; and the code version branch code version can be rolled back to a historical stable version. After subsequent repairs, several consecutive builds (e.g., 3) without critical warnings are required before the commit privileges for the continuous integration task of the corresponding code version branch can be restored.
[0172] In some possible implementations, the notification method under the third level is as follows: a third alarm notification is sent to the second device of the person above the submitter of the abnormal task and the first management device's superior management device. The third alarm notification can be immediate, for example, by using multiple instant messaging software to notify the person above the submitter of the abnormal task (e.g., the R&D team leader, corresponding to the second device) and the CI system administrator's superior management device (e.g., the operations supervisor, corresponding to the superior management device). The processing time limit for the third alarm notification can be set to a shorter time range, such as within 30 minutes.
[0173] It should be understood that those skilled in the art can set the abnormal response handling methods for each abnormality level according to the actual situation, and this disclosure does not impose any restrictions on this.
[0174] In this way, a graded response to anomalies can be achieved based on the severity level of the continuous integration system anomalies. Minor and moderate anomalies do not affect the overall operation of the system, thus improving the overall efficiency of the system. In the case of severe anomalies, the processing of the corresponding branch is suspended and relevant personnel are alerted to handle the situation in a timely manner, thereby avoiding the accumulation of failed task builds and reducing the overall anomaly risk of the system.
[0175] Figure 2 This is a schematic diagram of the module architecture of an anomaly detection system corresponding to an anomaly detection method provided in this embodiment of the disclosure. (Refer to...) Figure 2 The anomaly detection system includes a CI cluster status management module 21, a CI data real-time acquisition module 22, a CI control chart calculation module 23, a CI anomaly detection module 24, and a CI anomaly response module 25.
[0176] In the example, the CI cluster status management module 21 is used to monitor the resource status of the continuous integration system, i.e., the CI cluster's cluster resources. The CI cluster status management module 21 obtains information such as the cluster resource utilization rate of various cluster resources in the continuous integration system and the submission requests of CI tasks through various components and / or tools; based on this information, it can determine the cluster availability status, CI task queuing status, and other information.
[0177] Among them, the CI cluster state management module 21 utilizes components and / or tools such as Kubernetes (an open-source container orchestration platform), containerd (for managing the lifecycle of containers on a single node), GPU Operator (a component for configuring and managing GPU resources in a cluster), and CI / CD tool Jenkins (an automated CI / CD (Continuous Delivery) tool).
[0178] In the example, the CI data real-time acquisition module 22 uses three task dimension tags—code repository identifier, code version branch identifier, and hardware resource type identifier—and various components and / or tools to collect data from continuous integration tasks in real time. This yields initial task information such as task build results (success / failure), task build duration, task test results (success / failure), and task queuing time. The components and / or tools used by the CI data real-time acquisition module 22 include the CI / CD tool Jenkins, the time-series database InfluxDB (a high-performance time-series database), and a CI log parser (a tool / component specifically for processing CI process logs).
[0179] In the example, after collecting the initial task information of the continuous integration task, the CI data real-time acquisition module 22 can standardize the initial task information, including task completion time, task dimension labels, task metric values, etc.; and perform data cleaning on the continuous integration task to obtain a valid dataset of continuous integration tasks. The dataset includes the task identifier ID and task information for each continuous integration task. The task information includes at least one of the following: task request time, task dimension label, task build result, task build duration, task test result, and task queuing time.
[0180] Figure 3 This is a schematic diagram illustrating the processing flow of an anomaly detection method provided in an embodiment of this disclosure. (Refer to...) Figure 3 In S301, the CI data real-time acquisition module 22 obtains the task information of the continuous integration task; in S302, the CI control chart calculation module 23 divides the task into multiple task subgroups according to the task completion time of the continuous integration task.
[0181] In the example, the CI control chart calculation module 23 matches control charts according to the type of each target task indicator in S303; that is, discrete task indicators are matched with p-control charts, and continuous second task indicators are matched with XR control charts. In S304, discrete indicators are calculated using p-control charts; in S305, continuous indicators are calculated using XR control charts. Furthermore, in S306, the CI control chart statistics (mean / range / σ) for each target task indicator are calculated according to the task subgroups, and the control chart baseline values (centerline / control limits) are determined, including adaptive adjustments to the control chart baseline values. The components and tools used by the CI control chart calculation module 23 include a custom calculation engine (e.g., implemented in Python) and the time-series database InfluxDB.
[0182] In the example, the CI anomaly detection module 24 applies the Shewhart anomaly criterion, i.e., the multiple anomaly detection conditions mentioned above, to detect whether the target task indicators are abnormal, achieving multiple anomaly detection in S306. It determines the CI anomaly flags (Anomaly1~Anomaly4) for multiple anomaly detection conditions, calculates the overall anomaly score, obtains the anomaly detection result in S307, and determines the anomaly level (levels 1-3, corresponding to severe, medium, and mild anomaly levels in S308), generating a CI anomaly report. Furthermore, the CI anomaly detection module 24 also dynamically updates the anomaly detection conditions in S306.
[0183] In the example, the CI anomaly response module 25 utilizes corresponding components and / or tools to perform anomaly response processing (including retry, pause, rollback, etc.) at the anomaly level in S309. The CI anomaly response module 25 can update the CI monitoring dashboard in real time, outputting CI response commands (retry build / pause commit), alarm notifications, and information such as CI monitoring dashboard data (build success rate / average build time, etc.). The components and / or tools utilized by the CI anomaly response module 25 include CI / CD tools such as Jenkins, gateways, and monitoring dashboards such as Grafana.
[0184] The following is an application example of the anomaly detection method provided in this embodiment. In the example, the typical scenario is the construction task "AI operator library-master branch-GPU_TYPE1". The process follows the entire chain of "initialization--collection--calculation--detection--response", focusing on real-time performance and construction assurance.
[0185] Step 1: Initialize CI control parameters.
[0186] Core parameter configuration: Discrete indicator (construction success rate) subgroup size n=100, continuous indicator (construction time) subgroup size n=5, forgetting factor λ=0.9.
[0187] Historical benchmark calculation: Pull the master branch GPU_TYPE1 of the AI operator library for the past 30 days to build full data, and generate initial control limits through statistical modeling.
[0188] Success rate: Calculate the control limits using the p-control chart formula.
[0189] The failure rate (CL) is 2%.
[0190] Build failure rate σ= =1.4%.
[0191] The failure rate (UCL) is 2% + 3 × 1.4% = 6.2%.
[0192] The failure rate LCL is calculated as max(0, (2% - 3 × 1.4%)) = 0%.
[0193] For the control chart baseline values of the task construction success rate in the task indicators: center line CL=98%; UCL=100%; LCL=100%-6.2%=93.8%.
[0194] For the task construction time in the task indicators: calculate the control limits using the XR control chart formula, and the average range of all subgroups. =4.65min; mean of all subgroups ≈45min ≈4.65, σ= ≈2min (the range coefficient corresponding to subgroup size n=5) If the control chart baseline value is: CL = =45min; Build time UCL=CL+ × =45 + 0.577 × 4.65 ≈ 47.7; Build time =CL- × =45-0.577×4.65≈42.3.
[0195] Output: Initial control parameter configuration file and baseline control limit dataset, which serve as the calculation baseline for subsequent processes.
[0196] Step 2: Real-time acquisition of CI data.
[0197] Data collection dimensions: Tags are applied based on three dimensions: "repository = AI operator library, branch = master, GPU model = GPU_TYPE1" to ensure data traceability.
[0198] Collection frequency: Build time / build result is collected once every 1 minute (can be adjusted according to the average number of code commits per day), GPU node load (video memory / memory / CPU usage) is collected once every 10 seconds, and the build task ID is synchronously associated.
[0199] Data cleaning: A dual cleaning process of "3σ criterion + business rules" is adopted to remove extreme values, filter non-system abnormal data such as "manual termination of build", and output a standardized CI dataset (format: timestamp | three-dimensional label | build time | build result | GPU node ID, example: 202X-10-24 09:00:00 | AI operator library-master-GPU_TYPE1 | 52min | success | node-08).
[0200] Step 3: CI control chart calculation.
[0201] Discrete performance indicator calculation (task construction success rate): Based on the 100 construction data collected in step 2 (96 successful, 4 failed), calculate the p-control chart parameter: Subgroup construction failure rate CL = =4%; Subgroup construction failure rate σ= ≈1.96%; Subgroup construction failure rate UCL=4%+3×1.96%≈9.88%; Subgroup construction failure rate LCL=max(0,(4%-3×1.96%))=0%.
[0202] Then we have: Subgroup construction success rate CL=96%; Subgroup construction success rate UCL=100%; Subgroup construction success rate LCL=100%-9.88%=90.12%.
[0203] Centerline Update: Execute CI baseline adaptive adjustment algorithm (real-time fine-tuning) — failure rate centerline CL=2% from the previous detection period, current subgroup mean Substitute 4% and λ=0.9 into Formula 1.
[0204] Dynamically build failure rate centerline =λ +(1-λ) =0.9×2%+0.1×4%=2.2%.
[0205] Standard deviation of dynamic build failure rate = = ≈1.44%.
[0206] Dynamic build failure rate = +3 2.2% + 3 × 1.44% ≈ 6.52%.
[0207] Dynamic build failure rate =max(0, -3 )=max(0,(2.2%-3×1.44%))=0%.
[0208] Therefore: Dynamic build success rate CL = 100% - 2.2% = 97.8%; Dynamic build success rate UCL = 100%; Dynamic build success rate LCL = 100% - 6.52% = 93.48%.
[0209] Continuous index calculation (average construction time): Based on the 5 construction time data collected in step 2 (48 min, 51 min, 53 min, 49 min, 54 min), calculate the XR plot parameters: subgroup mean. =(48+51+53+49+54) / 5=51min; Subgroup range R=54-48=6min.
[0210] Centerline Update: Execute the CI baseline adaptive adjustment algorithm (real-time fine-tuning) – construct the time centerline CL=45min from the previous period and the current subgroup mean. Substitute 51min and λ=0.9 into Formula 1.
[0211] Dynamically constructing the time centerline =λ +(1-λ) =0.9×45+0.1×51=45.6min.
[0212] Dynamically construct the standard deviation of time = = 2.55
[0213] Dynamically constructed time = +3 =45.6 + 3 2.55 = 53.25 min
[0214] Dynamically constructed time = max(0, - 3 ) = max(0, 45.6 - 3 2.55) = 37.95 min
[0215] Output: Complete parameters (CL / UCL / LCL) of the p-chart and X-R chart, real-time statistic dataset, and push them to the CI anomaly detection module
[0216] Step 4: CI anomaly detection
[0217] Task construction success rate: The success rate of the current subgroup is 96%, within the control limit range (LCL = 93.48% < 96% < UCL = 100%), and there is no anomaly in criterion 1 (point out of bounds):<l = 0; The success rates of the recent 7 subgroups are all lower than their respective dynamic CLs (the dynamic construction success rate CL for each subgroup is dynamically calculated, e.g., the CL for the recent 1 subgroup is 97.8%), triggering criterion 2 (7 points on the same side): = 1
[0218] Average construction duration: The mean of the current subgroup is 51 min, within the control limit range (LCL = 37.95 min < 51 min < UCL = 53.25 min), and there is no anomaly in criterion 1 (point out of bounds): = 0; The means of the recent 7 subgroups show a continuous upward trend (44 min → 46.6 min → 47.5 min → 48 min → 49.3 min → 50.8 min → 51 min), triggering criterion 3 (7 points continuously rising): = 1; Among the means of the recent 3 subgroups (49.3 min, 50.8 min, 51 min), 2 points exceed the dynamic CL + 2σ of the corresponding subgroups (the CL + 2σ of the second subgroup = 50.2 min, the CL of the third subgroup = 45.6 min, σ = 2.55 min, so CL + 2σ = 50.7 min), triggering criterion 4 (3 points, 2 exceeding 2σ): = 1
[0219] Total anomaly score S = construction success rate score + construction time score = 1.5 +1 =8.
[0220] Anomaly Level Determination: 8 This triggers a level 2 (moderate) exception.
[0221] Anomaly Report Archiving: Records the anomaly period (09:00-09:30), associated anomaly build task ID (anomaly build task in build-10000~build-10004 corresponding to the most recent anomaly subgroup, such as build-10002, build-10004), and associated GPU node (such as node-07, node-08) corresponding to the anomaly build task.
[0222] Step 5: CI anomaly response.
[0223] Task processing: Automatically trigger the retry mechanism for abnormal build task IDs (build-10002, build-10004) (up to 3 times); through the CI task controller, prioritize scheduling retry tasks to idle GPU_TYPE1 nodes (node-10) and isolate the original high-load nodes (node-07, node-08) to avoid resource interference.
[0224] Notification push: Layered notifications are pushed via gateway and SMS: R&D team (code committer): "build-10002 and build-10004 triggered level 2 anomalies: build success rate is nearly 7 points lower than CL; build time is nearly 7 points higher, 3 points 2 exceeds 2σ; retried to node-10, recent code commit records need to be checked in the kernel within 2 hours"; CI administrator: "Two build tasks associated with nodes node-07 and node-08 have time anomalies, the VRAM / RAM / CPU / load curves of the nodes from 08:00 to 09:30 need to be exported within 2 hours to investigate long-term load fluctuations.
[0225] The anomaly detection method according to the embodiments of this disclosure can combine CI quality inspection with Shewhart quality control charts in the field of statistical quality control. It divides continuous integration tasks within the inspection period into task subgroups based on task completion time and task indicators, determines the indicator values of the task subgroups based on task information, performs anomaly detection on the indicator values of the task subgroups based on the control chart baseline values and anomaly detection conditions within the inspection period, and obtains anomaly detection results. Furthermore, it executes anomaly response processing corresponding to the anomaly level when anomalies exist, thereby improving the accuracy of anomaly identification.
[0226] The anomaly detection method according to embodiments of this disclosure can establish a dedicated CI quality indicator system. Discrete CI indicators use p-control charts, and continuous CI indicators use XR control charts, improving the accuracy of anomaly detection and the precision of anomaly identification. It can achieve adaptive CI benchmarks, dynamically adapting to CI indicator shifts caused by code iteration, maintaining monitoring effectiveness, and eliminating the need for frequent manual threshold adjustments. It combines CI anomaly detection conditions and CI anomaly response mechanisms to achieve CI anomaly responses at multiple anomaly levels, shortening the CI problem handling cycle. It can achieve automatic retry of builds under level 2 anomalies, avoiding manual intervention and significantly reducing build failure processing time. It can achieve paused branch commits under level 3 anomalies, preventing failed builds from accumulating and occupying system resources, and improving the effective utilization of system resources.
[0227] It is understood that the various method embodiments mentioned above in this disclosure can be combined with each other to form combined embodiments without violating the principle and logic. Due to space limitations, this disclosure will not elaborate further. Those skilled in the art will understand that in the above methods of specific implementation, the specific execution order of each step should be determined by its function and possible internal logic.
[0228] In addition, this disclosure also provides an anomaly detection device, electronic device, and computer-readable storage medium, all of which can be used to implement any of the anomaly detection methods provided in this disclosure. The corresponding technical solutions and descriptions are described in the corresponding section of the method and will not be repeated here.
[0229] Figure 4 This is a block diagram of an anomaly detection device provided in an embodiment of this disclosure.
[0230] Reference Figure 4 This disclosure provides an anomaly detection device, which includes the following modules 41-44.
[0231] The task grouping module 41 is used to divide continuous integration tasks whose task completion time is within the first detection period into at least one task subgroup.
[0232] The indicator value determination module 42 is used to determine the indicator value of the target task indicator for any task subgroup in the at least one task subgroup, based on the task information of the continuous integration task in the task subgroup.
[0233] The anomaly detection module 43 is used to perform anomaly detection on the indicator values of the task subgroup based on the control chart baseline value of the target task indicator in the first detection period and the preset anomaly detection conditions, and obtain the anomaly detection result of the first detection period; the anomaly detection result is used to indicate whether there is an anomaly and the anomaly level when there is an anomaly.
[0234] An anomaly response module 44 is used to perform an anomaly response process corresponding to the anomaly level of the anomaly detection result when the anomaly detection result indicates that an anomaly exists.
[0235] In some possible implementations, the target task indicator includes a discrete first task indicator, the first control chart of the first task indicator is a p-control chart, and the first task indicator corresponds to a first task subgroup; the device further includes: a first reference value determination module, configured to determine a first mean of the first task indicator based on the first indicator value of the first task subgroup for the first task indicator within the first detection period; and to determine a control chart reference value of the first control chart based on the first mean, wherein the control chart reference value of the first control chart includes at least one of a first centerline, a first upper control limit, and a first lower control limit.
[0236] In some possible implementations, the target task indicator includes a continuous second task indicator, the second control chart of the second task indicator is an XR control chart, and the second task indicator corresponds to a second task subgroup; the device further includes: a second reference value determination module, used to determine a second mean of the subgroup range of the second task subgroup and a third mean of the second task indicator based on the second indicator value of the second task subgroup for the second task indicator within the first detection period; and to determine a control chart reference value of the second control chart based on the second mean and the third mean, wherein the control chart reference value of the second control chart includes at least one of the second center line, the second upper control limit, and the second lower control limit of the range chart, and at least one of the third center line, the third upper control limit, and the third lower control limit of the mean chart.
[0237] In some possible implementations, the apparatus further includes: a baseline update module, configured to determine an initial control chart baseline value for the first detection period based on the indicator values of the at least one task subgroup for the target task indicator; and to determine a control chart baseline value for the first detection period based on the control chart baseline value for the second detection period, the initial control chart baseline value for the first detection period, and a preset forgetting factor, wherein the second detection period is the previous detection period of the first detection period.
[0238] In some possible implementations, the anomaly detection conditions include at least one of a first anomaly condition, a second anomaly condition, a third anomaly condition, and a fourth anomaly condition, wherein the first anomaly condition indicates that the indicator value of a task subgroup exceeds the control limit range of the control chart baseline value of the target task indicator; the second anomaly condition indicates that the indicator values of M1 consecutive task subgroups are on the same side of the center line of the control chart baseline value of the target task indicator; the third anomaly condition indicates that the indicator values of M2 consecutive task subgroups are monotonically increasing or monotonically decreasing; the fourth anomaly condition indicates that M4 of the indicator values of M3 consecutive task subgroups exceed twice the standard deviation range on the same side of the center line, where M1, M2, M3, and M4 are integers greater than 1, and M4... <M3。
[0239] In some possible implementations, the anomaly detection result includes the anomaly level. The anomaly detection module 33 is configured to: determine the anomaly detection conditions triggered by the task subgroups within the first detection period for the target task indicator; determine the anomaly score of the target task indicator based on the anomaly detection conditions triggered by the task subgroups and the risk weight of each anomaly detection condition; if there are multiple target task indicators, determine the overall anomaly score of the first detection period based on the indicator weight and anomaly score of each of the multiple target task indicators; and determine the anomaly level of the first detection period based on the overall anomaly score, wherein the anomaly detection result includes the anomaly level.
[0240] In some possible implementations, the task grouping module 41 is used to divide continuous integration tasks with the same task dimension labels and whose task completion time is within the first detection period into at least one task subgroup based on the task information of the continuous integration tasks in the continuous integration system and the target task indicators to be analyzed.
[0241] The task information includes at least one of the following: task request time, task dimension label, task completion time, task construction result, task construction duration, task test result, and task queuing duration.
[0242] In some possible implementations, the task dimension label includes at least one of the code repository identifier, code version branch identifier, and hardware resource type identifier of the continuous integration task; the anomaly level includes at least two of a first level, a second level, and a third level, wherein the third level is higher than the second level, the second level is higher than the first level, and the third level is higher than the first level. Specifically, the anomaly response handling corresponding to the first level includes: marking the anomaly task subgroup and the task dimension label of the anomaly tasks within the anomaly task subgroup, and sending a first alarm notification to the first management device; the anomaly response handling corresponding to the second level includes: retrying to build the anomaly tasks within the anomaly task subgroup, updating the display page of the target task metrics, and sending a second alarm notification to the first device of the person who submitted the anomaly task and the first management device; the anomaly response handling corresponding to the third level includes at least one of the following: stopping the task submission of the code version branch corresponding to the anomaly task within the anomaly task subgroup, restarting the build node of the anomaly task in the continuous integration system, rolling back the code version branch code version to a stable version, and sending a third alarm notification to the second device of the person above the person who submitted the anomaly task and the management device above the first management device.
[0243] In some possible implementations, the continuous integration task includes code compilation task and functional testing task, and the target task metric includes a discrete first task metric and a continuous second task metric, wherein the number of tasks in the task subgroups of the first task metric and the second task metric are different; the first task metric includes at least one of task build success rate, task build retry success rate, and task test pass rate, and the second task metric includes at least one of average build time and average queuing time.
[0244] Figure 5 This is a block diagram of an electronic device provided in an embodiment of the present disclosure.
[0245] Reference Figure 5 This disclosure provides an electronic device, which includes: at least one processor 701; at least one memory 702; and one or more I / O interfaces 703 connected between the processor 701 and the memory 702; wherein the memory 702 stores one or more computer programs that can be executed by the at least one processor 701, and the one or more computer programs are executed by the at least one processor 701 to enable the at least one processor 701 to perform the above-described anomaly detection method.
[0246] This disclosure also provides a computer-readable storage medium storing a computer program thereon, wherein the computer program, when executed by a processor, implements the above-described anomaly detection method. The computer-readable storage medium may be volatile or non-volatile.
[0247] This disclosure also provides a computer program product, including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code, wherein when the computer-readable code is run in the processor of an electronic device, the processor in the electronic device executes the above-described anomaly detection method.
[0248] Those skilled in the art will understand that all or some of the steps, systems, and apparatuses disclosed above, and their functional modules / units, can be implemented as software, firmware, hardware, or suitable combinations thereof. In hardware implementations, the division between functional modules / units mentioned above does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed collaboratively by several physical components. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit (ASIC). Such software can be distributed on a computer-readable storage medium, which may include computer storage media (or non-transitory media) and communication media (or transient media).
[0249] As is known to those skilled in the art, the term computer storage medium includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information, such as computer-readable program instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), static random access memory (SRAM), flash memory or other memory technologies, portable compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and is accessible to a computer. Furthermore, it is known to those skilled in the art that communication media typically contain computer-readable program instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery medium.
[0250] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.
[0251] The computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), are personalized by utilizing state information from computer-readable program instructions. These electronic circuits can execute computer-readable program instructions to implement various aspects of this disclosure.
[0252] The computer program product described herein can be implemented specifically through hardware, software, or a combination thereof. In one alternative embodiment, the computer program product is specifically embodied in a computer storage medium; in another alternative embodiment, the computer program product is specifically embodied in a software product, such as a software development kit (SDK), etc.
[0253] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.
[0254] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.
[0255] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.
[0256] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
[0257] Example embodiments have been disclosed herein, and while specific terminology has been used, it is for illustrative purposes only and should be construed as such, and is not intended to be limiting. In some instances, it will be apparent to those skilled in the art that features, characteristics, and / or elements described in connection with particular embodiments may be used alone, or in combination with features, characteristics, and / or elements described in connection with other embodiments, unless otherwise expressly indicated. Therefore, those skilled in the art will understand that various changes in form and detail may be made without departing from the scope of this disclosure as set forth by the appended claims.
Claims
1. An anomaly detection method, characterized in that, include: The continuous integration tasks whose completion time is within the first detection cycle are divided into at least one task subgroup. For any task subgroup in the at least one task subgroup, the indicator value of the task subgroup for the target task indicator is determined based on the task information of the continuous integration task within the task subgroup. Based on the control chart baseline value of the target task indicator within the first detection period and the preset anomaly detection conditions, anomaly detection is performed on the indicator values of the task subgroup to obtain the anomaly detection result of the first detection period; the anomaly detection result is used to indicate whether an anomaly exists and the anomaly level when an anomaly exists. If the anomaly detection result indicates the presence of an anomaly, perform an anomaly response process corresponding to the anomaly level of the anomaly detection result.
2. The method according to claim 1, characterized in that, The target task indicator includes a discrete first task indicator, the first control chart of the first task indicator is a p-control chart, and the first task indicator corresponds to a first task subgroup; the method further includes: Based on the first indicator value of the first task indicator for the first task subgroup within the first detection period, determine the first average value of the first task indicator. Based on the first mean, the control chart reference value of the first control chart is determined, and the control chart reference value of the first control chart includes at least one of the first center line, the first upper control limit, and the first lower control limit.
3. The method according to claim 1, characterized in that, The target task indicator includes a continuous second task indicator, the second control chart of the second task indicator is an XR control chart, and the second task indicator corresponds to a second task subgroup; the method further includes: Based on the second indicator value of the second task subgroup for the second task indicator within the first detection period, determine the second mean of the subgroup range of the second task subgroup and the third mean of the second task indicator. Based on the second mean and the third mean, the control chart baseline value of the second control chart is determined. The control chart baseline value of the second control chart includes at least one of the second center line, the second upper control limit and the second lower control limit of the range chart, and at least one of the third center line, the third upper control limit and the third lower control limit of the mean chart.
4. The method according to any one of claims 1-3, characterized in that, The method further includes: Based on the indicator values of the at least one task subgroup for the target task indicator, determine the initial control chart baseline value for the first detection cycle; The control chart baseline value for the first detection period is determined based on the control chart baseline value for the second detection period, the initial control chart baseline value for the first detection period, and a preset forgetting factor. The second detection period is the detection period preceding the first detection period.
5. The method according to any one of claims 1-4, characterized in that, The anomaly detection conditions include at least one of a first anomaly condition, a second anomaly condition, a third anomaly condition, and a fourth anomaly condition. Wherein, the first abnormal condition indicates that the indicator value of the task subgroup exceeds the control limit range of the control chart baseline value of the target task indicator. The second abnormal condition indicates that the index values of M1 consecutive task subgroups are on the same side of the center line of the control chart baseline value of the target task index; The third abnormal condition indicates that the index values of M2 consecutive task subgroups are monotonically increasing or monotonically decreasing. The fourth abnormal condition indicates that among M3 consecutive task subgroups, M4 index values exceed twice the standard deviation range on the same side of the center line, where M1, M2, M3, and M4 are integers greater than 1, and M4... <M3。 6. The method according to claim 5, characterized in that, The anomaly detection result includes the anomaly level. The process of performing anomaly detection on the indicator values of the task subgroup to obtain the anomaly detection result for the first detection period includes: For the target task indicators, determine the abnormal detection conditions triggered by the task subgroups within the first detection period; Based on the anomaly detection conditions triggered by the task subgroups and the risk weight of each anomaly detection condition, the anomaly score of the target task indicator is determined. When there are multiple target task indicators, the overall abnormal score of the first detection period is determined based on the indicator weight and abnormal score of each target task indicator among the multiple target task indicators. The anomaly level of the first detection period is determined based on the overall anomaly score.
7. The method according to any one of claims 1-6, characterized in that, The step of dividing the continuous integration tasks whose completion time is within the first detection period into at least one task subgroup includes: Based on the task information of the continuous integration tasks in the continuous integration system and the target task metrics to be analyzed, continuous integration tasks with the same task dimension labels and whose task completion time is within the first detection period are divided into at least one task subgroup. The task information includes at least one of the following: task request time, task dimension label, task completion time, task construction result, task construction duration, task test result, and task queuing duration.
8. The method according to claim 7, characterized in that, The task dimension label includes at least one of the following: the code repository identifier, the code version branch identifier, and the hardware resource type identifier of the continuous integration task; The anomaly level includes at least two of a first level, a second level, and a third level, wherein the third level is higher than the second level, the second level is higher than the first level, and the third level is higher than the first level. The abnormal response processing corresponding to the first level includes: marking the abnormal task subgroup and the task dimension label of the abnormal task in the abnormal task subgroup, and sending a first alarm notification to the first management device. The abnormal response handling corresponding to the second level includes: retrying to construct the abnormal task in the abnormal task subgroup, updating the display page of the target task indicator, and sending a second alarm notification to the first device of the person who submitted the abnormal task and the first management device. The abnormal response handling corresponding to the third level includes at least one of the following: stopping the task submission of the code version branch corresponding to the abnormal task in the abnormal task subgroup, restarting the build node of the abnormal task in the continuous integration system, rolling back the code version branch code version to a stable version, and sending a third alarm notification to the second device of the person above the person who submitted the abnormal task and the management device above the first management device.
9. The method according to any one of claims 1-8, characterized in that, The continuous integration task includes code compilation task and functional testing task. The target task indicator includes a discrete first task indicator and a continuous second task indicator. The number of tasks in the task subgroups of the first task indicator and the second task indicator are different. The first task metric includes at least one of the following: task build success rate, task build retry success rate, and task test pass rate. The second task metric includes at least one of the following: average build time and average queuing time.
10. An anomaly detection device, characterized in that, include: The task grouping module is used to divide continuous integration tasks whose completion time is within the first detection period into at least one task subgroup. The indicator value determination module is used to determine the indicator value of the task subgroup for the target task indicator for any task subgroup in the at least one task subgroup, based on the task information of the continuous integration task in the task subgroup. An anomaly detection module is used to perform anomaly detection on the indicator values of the task subgroup based on the control chart baseline value of the target task indicator within the first detection period and preset anomaly detection conditions, and obtain the anomaly detection result of the first detection period; the anomaly detection result is used to indicate whether an anomaly exists and the anomaly level when an anomaly exists. An anomaly response module is used to perform anomaly response processing corresponding to the anomaly level of the anomaly detection result when the anomaly detection result indicates that an anomaly exists.
11. An electronic device, characterized in that, include: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores one or more computer programs that can be executed by the at least one processor, the one or more computer programs being executed by the at least one processor to enable the at least one processor to perform the anomaly detection method as described in any one of claims 1-9.
12. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the anomaly detection method as described in any one of claims 1-9.