An abnormality detection and automatic rollback method, system and computer readable storage medium after a system version update

By collecting baseline data, real-time monitoring, and automatic rollback, the problems of insufficient monitoring and lack of self-healing ability after system version updates are solved, realizing intelligent anomaly detection and automated recovery of the system, and improving stability and operation and maintenance efficiency.

CN122309239APending Publication Date: 2026-06-30亓泽辰

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
亓泽辰
Filing Date
2026-03-31
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies lack a continuous real-time monitoring mechanism after system version updates, making it impossible to identify anomalies in a timely manner. They rely on manual observation, and rollback operations are complex and lack self-healing capabilities, resulting in low system stability and operational efficiency.

Method used

Collect and store baseline data of the target system before the update, continuously collect real-time data, detect anomalies through preset rules, and automatically roll back to the version before the update when an anomaly is detected.

Benefits of technology

It enables proactive monitoring, intelligent anomaly detection, and automated recovery after system version updates, improving system stability and operational efficiency while reducing manual intervention and fault response time.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309239A_ABST
    Figure CN122309239A_ABST
Patent Text Reader

Abstract

With the continuous evolution of various systems, version updates have become commonplace. Whether it's software systems, hardware firmware, or combined hardware and software systems (such as industrial equipment, robots, and vehicle systems), version updates can introduce unexpected negative effects, such as performance degradation, reduced user experience, equipment failure, or even system crashes. Existing technologies suffer from deficiencies such as a lack of real-time monitoring, absence of anomaly detection mechanisms, reliance on manual rollback, rigid threshold settings, and lack of automatic recovery capabilities. Therefore, a method is needed that can monitor the system's operational status in real time after a version update, automatically detect abnormal indicators, and trigger automatic rollback to ensure system stability and security. This application provides a method, system, and computer-readable storage medium for anomaly detection and automatic rollback after a system version update, aiming to intelligently identify abnormal fluctuations by continuously monitoring operational indicators and automatically rollback to a stable state when necessary.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of system monitoring and fault self-healing technology, and more specifically, to a method for anomaly detection and automatic rollback after a system version update, a system, and a computer-readable storage medium. Background Technology

[0002] With the continuous evolution of various systems, version updates have become an inevitable part of the development of software, hardware, and integrated software and hardware systems. In a wide range of scenarios such as industrial equipment, robots, automotive systems, smart homes, and cloud service platforms, while version updates can bring functional optimization and performance improvements, they are often accompanied by unexpected negative effects, such as decreased system performance, reduced user experience, equipment malfunctions, or even overall system crashes. Existing technologies have significant shortcomings in handling these issues. After a system version update, the lack of a continuous real-time monitoring mechanism for its operational status means that abnormal changes in performance indicators such as response time, resource utilization, or user satisfaction cannot be detected in a timely manner. Problems are often only discovered after users actively complain or the fault escalates. Simultaneously, the system cannot automatically identify abnormal fluctuations in operational indicators, including but not limited to deviations in performance indicators, physical indicators, behavioral indicators, or user feedback indicators, relying excessively on manual observation and reporting. This not only prolongs the problem response cycle but also increases blind spots in operations and maintenance. Once an anomaly is detected, rollback operations heavily rely on manual execution by maintenance personnel, resulting in delayed responses and complex operations. In unattended or high-risk scenarios, this can easily lead to operational errors or security risks. Rigid anomaly detection thresholds, employing fixed thresholds and failing to adapt to historical system data, dynamic changes in application scenarios, or environmental factors, lead to false alarms during normal fluctuations or missed alarms during critical anomalies. Furthermore, the system lacks self-healing capabilities, failing to automatically trigger recovery processes after anomaly detection. Fault states may continue to impact service continuity, causing potential user data loss, business interruptions, or even security incidents. These problems are particularly pronounced in distributed systems, critical infrastructure, and environments with high real-time requirements, severely restricting system stability and operational efficiency. Therefore, a solution is urgently needed that can proactively monitor the updated system status, intelligently identify anomaly indicators, and achieve automated recovery.

[0003] To address the aforementioned issues, existing technologies urgently need improvement. Summary of the Invention

[0004] The purpose of this application is to provide a method, system, and computer-readable storage medium for anomaly detection and automatic rollback after system version update, which has the advantages of realizing active monitoring, intelligent anomaly detection and automated recovery, significantly improving system stability and operation and maintenance efficiency, and reducing manual intervention and fault response time.

[0005] This application provides a method for anomaly detection and automatic rollback after a system version update, the technical solution of which is as follows: Includes the following steps: Collect and store baseline data representing various operational indicators of the target system before the update; After the target system is updated to a new version, real-time data of various operational metrics are continuously collected; Compare real-time data with baseline data and detect any anomalies using preset rules; When an anomaly is detected, a decision is made based on a preset decision-making strategy to determine whether to perform an automatic rollback. If an automatic rollback is performed, the system will restore the previous version and recover the relevant data from the backup.

[0006] Furthermore, this application also proposes that operational metrics include, but are not limited to, at least one of performance metrics, user feedback metrics, system interaction metrics, or business metrics.

[0007] Furthermore, this application also proposes that user feedback metrics include, but are not limited to, at least one of the following: negative feedback rate, user dissatisfaction rate, or interaction interruption rate.

[0008] Furthermore, this application also proposes that the preset rules include, but are not limited to, at least one of the following: deviation rules, trend rules, combination rules, or machine learning models.

[0009] Furthermore, this application also proposes that the deviation rules include: if a certain indicator meets a predetermined deviation condition compared with the baseline, it is judged as abnormal.

[0010] Furthermore, this application also proposes that the trend rule includes: if a certain indicator continues to deteriorate for multiple consecutive monitoring periods, it is judged as abnormal.

[0011] Furthermore, this application also proposes a threshold adaptive adjustment step: dynamically adjusting the threshold for anomaly detection based on historical operational data.

[0012] Furthermore, this application proposes that the decision-making strategy includes at least one of automatic rollback, semi-automatic rollback, or disabling rollback.

[0013] Furthermore, this application also proposes that restoring to the previous version includes invoking the version control module or firmware rollback mechanism; restoring from backup includes restoring from snapshots, images, or configuration backups.

[0014] Furthermore, this application also proposes that the target system includes a software system, a hardware system, a firmware system, or a combined software and hardware system.

[0015] Furthermore, this application also proposes an anomaly detection and automatic rollback system after a system version update, comprising: The indicator acquisition module is used to collect baseline and real-time data of operational indicators before and after the target system is updated; The monitoring module is used to continuously monitor real-time data; Anomaly detection engine is used to detect whether anomalies exist based on preset rules; The decision-making module is used to determine whether to perform automatic rollback based on preset decision-making strategies; The rollback execution module is used to restore the version before the update and restore relevant data from backup.

[0016] Furthermore, this application also proposes to include a threshold adaptive module for dynamically adjusting the threshold for anomaly detection based on historical operational data.

[0017] Furthermore, this application also proposes to include a user notification module for sending notifications to users upon anomaly detection or rollback decision.

[0018] Furthermore, this application proposes that the rollback execution module be configured to call the version control module to restore to the version before the update and restore relevant data from the snapshot.

[0019] Furthermore, this application also proposes a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the above-described method.

[0020] As can be seen from the above, the method, system, and computer-readable storage medium for anomaly detection and automatic rollback after system version update provided in this application solve the problems of lack of continuous monitoring, inability to automatically identify anomalies, and reliance on manual operation in the background technology by collecting baseline data, real-time monitoring, anomaly detection, and automatic rollback decision. It has the advantages of realizing active monitoring, intelligent anomaly detection, and automated recovery, significantly improving system stability and operation and maintenance efficiency, and reducing manual intervention and fault response time. Attached Figure Description

[0021] Several embodiments of this application are described below with reference to the accompanying drawings. It should be noted that the specific structures, modules, steps, parameters, and connections shown in the drawings are preferred embodiments of this application and not limitations on the scope of protection of this application. Those skilled in the art can make various modifications, substitutions, or combinations to the specific details shown in the drawings based on the teachings of this application, and these modified embodiments should still be considered to fall within the scope of protection of this application.

[0022] Figure 1 This diagram illustrates the overall architecture of an anomaly detection and automatic rollback system provided in this application. The modules can be adjusted according to actual applications and do not limit the scope of protection.

[0023] Figure 2 This application provides an anomaly detection flowchart, which is an example of the anomaly detection process of this application. The specific rule types and judgment logic can be configured according to actual needs and do not constitute a limitation on the claims.

[0024] Figure 3 This application provides a flowchart of rollback decision-making and execution. The flowchart shows an example of the rollback decision-making and execution process of this application. The specific decision-making strategy can be configured according to actual needs and does not constitute a limitation on the claims. Detailed Implementation

[0025] The technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. The components of this application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely represents selected embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application. Other technologies that may be mentioned in the embodiments can be implemented using existing technology or other patent applications filed by the applicant on the same day, and will not be repeated here. It should be particularly noted that the specific module divisions, process steps, data flow directions, status names, time values, etc., shown in the accompanying drawings are merely illustrative examples and should not constitute a limitation on the scope of protection of the claims of this application. The scope of protection of the claims is determined solely by their wording and should be interpreted in accordance with the overall content of the specification.

[0026] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this application, terms such as "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0027] During system version updates, existing technologies lack a continuous monitoring mechanism for operational status, resulting in the inability to promptly identify abnormal behaviors introduced by the new version. Specifically, no effective benchmark for comparison is established after the system update, fluctuations in operational metrics cannot be quantified and assessed, anomaly detection relies on human experience rather than automated rules, rollback operations require manual intervention and lack standardized procedures, and the lack of system self-healing capabilities allows fault states to persist. This issue directly impacts system stability metrics, manifesting as prolonged service interruption time, reduced fault recovery efficiency, and potential data integrity risks. Among these, the system's inability to autonomously identify performance degradation caused by version updates is the core defect.

[0028] For example, after a firmware update to an industrial robot control system, the monitoring system failed to capture the correlation between changes in motor temperature and joint vibration amplitude. When the new firmware version caused anomalies in the motion control algorithm, temperature parameters deviated from historical ranges and vibration amplitude showed an increasing trend. However, existing technologies only rely on fixed thresholds for independent indicator judgments, failing to incorporate the deterioration of multi-dimensional parameters into a comprehensive evaluation. Due to the lack of a baseline data storage mechanism before the update, the system cannot confirm the difference between the current state and the baseline before the update. Maintenance personnel can only learn about equipment malfunctions through user feedback, by which time the anomaly has already caused the actuator positioning deviation to exceed the process tolerance range. In this scenario, the mismatch between the monitoring cycle and the data acquisition frequency leads to the omission of critical fluctuation windows, the fixed threshold setting cannot adapt to parameter drift caused by equipment aging, and manual rollback operations require stopping the machine to check the configuration version and manually restore the backup, resulting in increased unplanned downtime on the production line.

[0029] If the above issues are not resolved, the system will face the risk of a continuous decline in service availability. The inability to promptly prevent abnormal states will cause the failure to spread to related subsystems. In critical infrastructure scenarios, undetected anomalies caused by version updates may trigger a chain reaction; for example, a single point of failure in an industrial control network can propagate to other device units via the bus. Simultaneously, delayed manual intervention leaves the system in a sub-optimal state for extended periods, and cumulative parameter deterioration may cause hardware damage or data inconsistency. This technical deficiency also hinders the construction of automated operation and maintenance systems, forcing maintenance teams to maintain high-frequency manual inspections. The system's inability to achieve self-healing capabilities will significantly increase operational complexity and reduce resource utilization efficiency.

[0030] To address this issue, this application proposes a method for anomaly detection and automatic rollback after a system version update, comprising the following steps: Collect and store baseline data representing various operational indicators of the target system before the update; After the target system is updated to a new version, real-time data of various operational metrics are continuously collected; Compare real-time data with baseline data and detect any anomalies using preset rules; When an anomaly is detected, a decision is made based on a preset decision-making strategy to determine whether to perform an automatic rollback. If an automatic rollback is performed, the system will restore the previous version and recover the relevant data from the backup.

[0031] For ease of understanding, the following explains some key terms in this embodiment: A target system refers to any system that requires version updates and monitoring of its operational status. This can include software systems, hardware systems, firmware systems, or a combination of both. For example, a cloud service platform, an industrial robot control unit, or an in-vehicle infotainment system can all serve as a target system.

[0032] Operational metrics refer to various data that reflect the operational status and performance of a target system. These metrics can cover multiple dimensions such as system performance, resource utilization, user experience, and device status, and are used to quantify and evaluate the health status of the system.

[0033] Baseline data refers to the stable state data of various operational indicators collected and stored before the target system is updated to a new version. This data serves as a reference standard for subsequent real-time data comparison and is used to identify abnormal fluctuations that may occur after the update.

[0034] Real-time data refers to the current data of various operational metrics continuously collected after the target system is updated to a new version. This data reflects the actual performance of the system when running the new version and serves as direct input for anomaly detection.

[0035] Preset rules refer to the logic or model used to determine whether there are anomalies between real-time data and baseline data. These rules define what degree of deviation or what pattern of change should be considered an anomaly.

[0036] An anomaly refers to a state in which the target system's operating metrics deviate from baseline data or exhibit unexpected trends after an update. The occurrence of anomalies may indicate a decline in system performance, functional failure, or potential risks.

[0037] A decision-making strategy refers to a predefined plan used to determine whether and how to perform an automatic rollback when an anomaly is detected. This strategy guides the system to take appropriate countermeasures based on the severity of the anomaly and business needs.

[0038] Automatic rollback refers to the operation where the system automatically restores the system to the stable version before the update without manual intervention when it detects an anomaly and determines that a rollback is necessary based on the decision-making strategy.

[0039] A backup is a stored copy used to restore relevant data and configurations during a system rollback. Backups ensure that the system can be restored to its pre-update usable state after a rollback.

[0040] This embodiment provides a method for anomaly detection and automatic rollback after a system version update, which is characterized by the following aspects: First, the method involves collecting and storing baseline data representing various operational metrics of the target system prior to the update. Before updating the target system, it is necessary to obtain key parameters under stable operating conditions as a reference. For example, the system can be manually configured to take a one-time snapshot of basic performance parameters such as CPU utilization, memory usage, and network latency at a specific time point, and save it to a text file or simple database on local storage. This data is considered the "fingerprint" of the system under normal operating conditions, providing a basis for comparison in subsequent anomaly detection.

[0041] Secondly, after the target system is updated to the new version, this method continuously collects real-time data on various operational metrics. After the system update, to promptly identify potential problems, continuous monitoring of the target system's operational status is necessary. Specifically, a scheduled task can be set up to retrieve various operational parameters of the current system at fixed intervals (e.g., every five minutes) through the system's built-in monitoring interface, such as the number of active processes, disk I / O rate, and service response time, and record this data. This continuous data collection ensures a comprehensive understanding of the system's operational status in the new version.

[0042] Furthermore, this method compares real-time data with baseline data and detects anomalies using preset rules. After acquiring real-time data, it needs to be compared with previously stored baseline data to identify any significant deviations. For example, a simple threshold rule can be set: if a real-time acquired metric (such as CPU utilization) exceeds a fixed percentage (e.g., 20%) of the corresponding baseline metric, it is considered potentially an anomaly. This comparison mechanism can initially filter out situations that do not conform to normal operating conditions.

[0043] Furthermore, when an anomaly is detected, the method decides whether to perform an automatic rollback based on a pre-defined decision-making strategy. Once the system identifies an anomaly, it needs to determine the next action based on a pre-set handling plan. For example, a simple decision logic can be set so that the system immediately triggers a rollback operation whenever any anomaly is detected. Alternatively, it can be set so that when an anomaly is detected, the system generates a warning message and waits for manual confirmation before performing the rollback. This decision-making mechanism ensures that the response to anomalies is predictable.

[0044] Finally, if an automatic rollback is performed, this method restores the system to the version before the update and recovers relevant data from backups. When the decision strategy indicates that a rollback is necessary, the system will take measures to restore the target system to its stable state before the update. Specifically, this can be done by manually redeploying the pre-update system installation package, or by copying previously manually created copies of critical system configuration files and data files to overwrite the current files, thus returning the system to its pre-update configuration and data state. This minimizes the impact of anomalies on system stability and data integrity.

[0045] The following example will provide a more detailed explanation of the above technical solution: Suppose a cloud service provider plans to upgrade its core database system to introduce new query optimization features. Before the upgrade, to ensure the stability of the upgraded system, the provider decides to use the method described in this embodiment.

[0046] First, before updating the database system to a new version, the system collects and stores baseline data representing various operational metrics prior to the update. Specifically, during a period of stable operation, the system records key performance indicators such as average query response time, transactions per second (TPS), CPU utilization, memory usage, and disk I / O. This data is stored as a baseline configuration file, serving as a performance reference for the database system under normal and healthy conditions.

[0047] Subsequently, once the database system is successfully updated to the new version and goes live, the system will continuously collect real-time data on various operational metrics. For example, every minute, the system will automatically retrieve data such as current query response time, TPS, and CPU utilization from the database monitoring interface. This real-time data is continuously collected to reflect the actual operational status of the new version of the database.

[0048] Next, the system compares the real-time data with the baseline data, detecting any anomalies using preset rules. Specifically, the system might preset a simple rule: if the real-time query response time exceeds a fixed percentage of the baseline query response time for five consecutive minutes (e.g., exceeding 50% of the baseline value), it is considered an anomaly. Once the real-time data is collected, this rule is immediately applied to determine if there has been a significant decline in the current database performance.

[0049] Furthermore, when an anomaly is detected, the system will decide whether to perform an automatic rollback based on a preset decision-making strategy. For example, the provider's preset decision-making strategy might be: if the detected anomaly belongs to the "high severity" category (such as a severe deterioration in query response time), an automatic rollback will be performed immediately; if it belongs to the "medium severity" category, a notification will be sent to the operations and maintenance personnel for manual confirmation before the rollback is performed. In this example, due to the severe deterioration in query response time, the system determines that an automatic rollback is necessary based on the strategy.

[0050] Finally, if the system decides to perform an automatic rollback, it will revert to the previous version and restore relevant data from the backup. Specifically, the system will stop the current new version of the database service and then redeploy the previous version. Simultaneously, the system will restore the database structure and data from a pre-created logical backup (e.g., a compressed SQL file or data directory), ensuring the rolled-back database can provide services normally. Through these steps, the database system can quickly recover from the abnormal state introduced by the new version and return to a known stable operating state.

[0051] Based on the above example of database system upgrades, the technical concept of this embodiment demonstrates a significant technical contribution. Traditional existing system version update methods often lack real-time, automated monitoring of the system's operational status after the update. For example, in the database upgrade scenario described above, without the method of this embodiment, operations and maintenance personnel might need to rely on manual observation or user feedback to discover the problem of deteriorating query response time, which typically leads to delayed problem detection and affects user experience.

[0052] This embodiment constructs a comprehensive, real-time monitoring system by collecting and storing baseline data representing various operational indicators before the target system is updated, and by continuously collecting real-time data of various operational indicators after the target system is updated to the new version. Compared with the existing technology that relies on manual inspection or passively waiting for user reports, this embodiment can proactively and continuously acquire the system's operating status, providing a timely and accurate data foundation for subsequent anomaly detection.

[0053] Furthermore, this embodiment achieves automated anomaly identification by comparing real-time data with baseline data and detecting anomalies using preset rules. In the example above, the system can automatically determine whether database performance is abnormal based on preset rules (such as query response time exceeding the baseline by 50%), without requiring manual data comparison or log analysis. This significantly improves the efficiency and accuracy of anomaly detection, avoiding the lack or rigidity of anomaly detection mechanisms in existing technologies.

[0054] More importantly, this embodiment endows the system with automated fault self-healing capabilities by determining whether to perform automatic rollback based on a preset decision strategy when an anomaly is detected, and if so, restoring to the version before the update and recovering relevant data from backups. In scenarios where database performance is severely degraded, the system can automatically trigger a rollback according to the preset strategy, restoring the database to the stable version before the update and recovering data from backups. This contrasts sharply with existing technologies where rollback operations are highly dependent on manual intervention, have delayed responses, and are complex. The solution in this embodiment can minimize the impact of anomalies on service availability and data integrity, ensuring the stability and reliability of the system after version updates.

[0055] In summary, this embodiment effectively solves the problems of insufficient monitoring, missing anomaly detection, reliance on manual rollback, and lack of automatic recovery capabilities after system version updates in the prior art by establishing a baseline, real-time monitoring, automated anomaly detection, intelligent decision rollback, and data recovery. It provides an efficient and reliable solution for the stable operation of the system after version updates.

[0056] Some of the solutions mentioned above propose operational metrics to collect system operation data for anomaly detection. However, in their implementation, the definitions of operational metrics are unclear or incomplete, which may not effectively cover multiple aspects such as performance, user feedback, system interaction and business, resulting in inaccurate anomaly detection or poor adaptability.

[0057] In this regard, this application further proposes that the above-mentioned operational indicators include, but are not limited to, at least one of performance indicators, user feedback indicators, system interaction indicators, or business indicators.

[0058] Operational metrics refer to data points or quantified values ​​used to quantify and evaluate the operational status and performance of a target system. These metrics form the basis for anomaly detection; continuous collection and analysis of them can effectively determine whether system anomalies have occurred after an update. Operational metrics can be collected in various ways, such as by deploying lightweight agents within the target system to automatically collect internal system data, or by using external monitoring tools and sensors to observe and record system behavior.

[0059] Performance metrics are key data points for measuring system efficiency, response speed, and resource utilization. They reflect the system's efficiency and resource consumption when processing tasks, and may include, for example, CPU utilization, memory usage, disk I / O throughput, network bandwidth utilization, API response time, error rate, and system throughput. By monitoring these metrics, problems such as system performance degradation, resource bottlenecks, or insufficient processing capacity can be detected in a timely manner.

[0060] User feedback metrics directly reflect users' perceptions of their system experience and satisfaction. These metrics are crucial for assessing whether system updates negatively impact user experience. Examples include user complaint rates, changes in app store ratings, user retention rates, session duration, perceived page load speed, and the number of negative feedback items submitted by users.

[0061] System interaction metrics are used to measure the health of interactions between internal system components or between the system and the external environment. They help identify communication failures within internal modules, abnormal external interface calls, or problems arising when collaborating with other systems. Examples include message queue backlog, inter-service call success rate, data transmission latency, number of external API calls, communication bus load, and packet loss rate.

[0062] Business metrics directly reflect the operational status of the core business functions supported by the system and the achievement of business objectives. These metrics are crucial to ensuring that system updates do not negatively impact core business processes. Examples include order success rate, payment success rate, transaction volume, user conversion rate, inventory accuracy, production line output, equipment utilization rate, and the completion time of key business processes.

[0063] This application's solution specifies the aforementioned operational metrics as at least one of performance metrics, user feedback metrics, system interaction metrics, or business metrics, enabling a more comprehensive and multi-dimensional coverage of all aspects of system operation during anomaly detection after system version updates. Before the target system is updated to a new version, the system collects and stores baseline data representing various operational metrics prior to the update. This baseline data provides a reference standard for subsequent anomaly detection. After the target system is updated to a new version, the system continuously collects real-time data of the aforementioned specified operational metrics. This real-time data includes information from multiple dimensions such as system performance, user experience, internal interactions, and business functions. Subsequently, the system compares this real-time data with the pre-stored baseline data and detects the presence of anomalies according to preset rules. For example, if performance metrics (such as API response time) deteriorate significantly, or user feedback metrics (such as negative feedback rate) rise sharply, the system can promptly identify potential problems. In this way, the solution of this application solves the problem of inaccurate anomaly detection caused by unclear or incomplete definitions of operational metrics in traditional methods, ensuring the comprehensiveness and accuracy of anomaly detection, thereby providing a reliable basis for subsequent automatic rollback decisions.

[0064] As a specific implementation method, the solution described in this application can be used for anomaly detection after updating the software version of an online e-commerce platform. Before the update, the system collects and stores baseline data, such as average order processing time (performance metric), user shopping cart abandonment rate (business metric), customer service complaint volume (user feedback metric), and payment gateway call success rate (system interaction metric) as baselines. After the update, the system continuously collects these metrics in real time. For example, through a monitoring agent deployed on the server, performance metrics such as CPU utilization, memory usage, and database query response time are obtained in real time; user feedback metrics, such as the number of negative comments submitted by users after the update and user session interruption rate, are obtained by analyzing user behavior logs and customer service system data; system interaction metrics such as call success rate and latency between microservices such as payment gateway and inventory service are obtained by monitoring inter-service communication logs; and business metrics such as order success rate and product browsing conversion rate are obtained through business database statistics. When the system detects that the order success rate (business metric) is 5% lower than the baseline and the payment gateway call success rate (system interaction metric) is declining, or the user shopping cart abandonment rate (business metric) is significantly higher than the baseline, the system will determine that there is an anomaly and decide whether to perform automatic rollback according to the preset decision-making strategy.

[0065] Through the above technical solution, this application ensures more comprehensive and accurate monitoring and anomaly detection of the system's operational status after system version updates. By clarifying and expanding the types of operational indicators, it covers multiple key dimensions such as performance, user feedback, system interaction, and business operations, effectively avoiding missed or false alarms caused by unclear or incomplete indicator definitions. This enables the system to detect potential problems introduced by updates earlier and more accurately, thus providing a solid data foundation for timely implementation of measures such as automatic rollback, significantly improving the accuracy of anomaly detection and the system's adaptability to update risks.

[0066] In some of the solutions described above in this application, user feedback metrics are proposed to detect system anomalies. However, the specific content of these user feedback metrics is not clearly defined, leading to inaccurate detection or failure to cover key user-related indicators, such as user dissatisfaction or interaction problems, thus affecting the comprehensiveness and accuracy of anomaly detection. Therefore, this application further proposes user feedback metrics including, but not limited to, at least one of the following: negative feedback rate, proportion of user dissatisfaction, or interaction interruption rate.

[0067] Negative feedback rate refers to the proportion of user feedback expressing dissatisfaction with a system or service within a specific time period, or the proportion of total user interactions. Its purpose is to directly quantify the intensity of negative emotions users experience after a system update. This can be achieved through statistical calculations of user-submitted negative reviews, complaints, and low ratings; or through sentiment analysis of text content such as user comments and social media mentions to identify and count negative emotions. User dissatisfaction ratio refers to the proportion of users expressing dissatisfaction identified through analysis of user behavior data or text data. Its purpose is to delve deeper into potential user dissatisfaction, providing early warnings through emotional changes even without explicit negative feedback. This can be achieved through natural language processing (NLP) technology to analyze the sentiment of user text in forums, social media, and customer service conversations to identify negative emotional expressions; or by combining user behavior patterns within the application (such as frequent exits, prolonged pauses, and repeated failed operations) with machine learning models to infer user dissatisfaction. Interaction interruption rate refers to the proportion of events that force a user's normal interaction process to be interrupted due to system failures, delayed responses, error messages, etc., out of the total number of interaction attempts. Its purpose is to measure the smoothness and stability of system interaction, directly reflecting the degree of impairment to the user experience. This can be achieved by recording the success and failure status of user operations in system logs and counting the number of failures caused by system issues; or by using front-end monitoring (RUM) technology to monitor user actions such as clicks, navigation, and form submissions on pages or applications in real time and record abnormal interruption events.

[0068] This application's solution refines user feedback metrics into specific metrics such as negative feedback rate, user dissatisfaction rate, and interaction interruption rate, enabling the anomaly detection engine to more comprehensively and accurately capture potential user experience issues after system updates. When the target system updates to a new version, the system continuously collects real-time data on various operational metrics, including these refined user feedback metrics. The negative feedback rate directly reflects users' subjective evaluation of the new version; an abnormally high rate indicates that the new version may have introduced features or defects that users find unsatisfactory. The user dissatisfaction rate, through deeper sentiment analysis, captures potential negative emotions from users, providing early warnings even without direct feedback, which is crucial for discovering hidden problems. The interaction interruption rate quantifies obstacles encountered by users during system interaction from the perspective of system usability and smoothness, such as operation failures and page lag; these interruptions are often the direct cause of user dissatisfaction. By monitoring these specific metrics in real time and comparing them with baseline data before the update, the anomaly detection engine can accurately determine whether there are user experience-related anomalies based on preset rules, such as deviation rules or trend rules. This detailed definition of indicators and monitoring mechanism enables the system to comprehensively evaluate user feedback from multiple dimensions, avoiding the underreporting or false alarms that may be caused by traditional general indicators, thus providing a more reliable basis for subsequent automatic rollback decisions.

[0069] The following example illustrates this. After updating a mobile application, the system continuously monitors its performance metrics. These metrics include: negative feedback rate, calculated by the proportion of one-star and two-star reviews in app store user comments and the number of "problem feedback" or "complaints" submitted by users within the application; user dissatisfaction rate, calculated by natural language processing of text content from in-app community forums or customer service chat logs to identify negative emotional words and expressions, and calculating the proportion of users with negative emotions among active users; and interaction interruption rate, calculated by recording the number of times users failed or were interrupted during key operations (such as payment, order submission, file upload) due to application crashes, network errors, or server unresponsiveness, and calculating its proportion of total operation attempts. These detailed user feedback metrics are continuously collected in real-time and compared with baseline data before the update. If any metric meets preset abnormal conditions (e.g., negative feedback rate increases by more than 20%, or interaction interruption rate exceeds 5% for two consecutive hours), the system will trigger an alarm and decide whether to perform an automatic rollback based on its decision-making strategy.

[0070] By refining user feedback metrics into specific metrics such as negative feedback rate, user dissatisfaction rate, and interaction interruption rate, the system can more accurately and comprehensively capture user experience issues during anomaly detection after version updates. This refinement avoids the blind spots that may arise from traditional general metrics. For example, the negative feedback rate directly quantifies explicit user dissatisfaction, the user dissatisfaction rate delves into potential emotional changes, and the interaction interruption rate directly reflects the impact of system availability and smoothness on user interaction. Therefore, this solution effectively improves the accuracy and timeliness of anomaly detection, ensuring rapid identification and rollback measures when user experience is compromised, thereby minimizing the negative impacts introduced by new versions and guaranteeing user satisfaction and system stability.

[0071] In some of the solutions described above in this application, preset rules are proposed to detect anomalies. However, in this process, the rules may be too simplistic and fail to fully cover different types of anomaly patterns, such as sudden deviations, continuous deterioration, or changes in a combination of multiple indicators, thereby affecting the accuracy and adaptability of the detection and increasing the risk of false alarms or missed alarms.

[0072] In this regard, this application further proposes preset rules, including but not limited to at least one of bias rules, trend rules, combination rules, or machine learning models.

[0073] Preset rules refer to a series of logical judgment conditions or models used to determine whether the target system's operating status is abnormal after a system version update. Their purpose is to provide clear judgment criteria for anomaly detection, ensuring that the system can identify potential problems in a timely and accurate manner. These rules can be customized according to different system characteristics and business needs to adapt to diverse anomaly scenarios. Deviation rules are a method for anomaly judgment based on the difference between real-time data and baseline data. Their purpose is to quickly capture sudden and significant changes in system operating indicators. For example, an anomaly can be determined when a real-time indicator deviates from its baseline value by more than a preset percentage (e.g., 25%); or an anomaly alarm can be triggered when a real-time indicator exceeds a specific confidence interval (e.g., 99% confidence interval) of the historical statistical distribution of the baseline indicator. Trend rules focus on the dynamic trends of system operating indicators over time, aiming to identify anomalies that are not instantaneous but gradually accumulate and continuously deteriorate. Their purpose is to capture progressive failures or performance degradation. For example, an anomaly can be defined when a certain indicator continuously changes in an unfavorable direction over multiple consecutive monitoring periods (e.g., five consecutive periods) (e.g., a continuous increase in response time or a continuous rise in error rate); or, by calculating the slope of the indicator's change within a sliding time window, if the slope exceeds a preset threshold, an abnormal trend is considered to exist. Combined rules determine whether the system is abnormal by comprehensively analyzing multiple related indicators. Their purpose is to handle complex anomaly scenarios that a single indicator cannot fully reflect, improving the accuracy and robustness of detection. For example, an anomaly can be defined only when both "CPU utilization" and "memory usage" simultaneously exceed their respective preset thresholds; or, weights can be assigned to different anomaly indicators, and when the weighted total score of these indicators exceeds a certain comprehensive threshold, the entire system is considered abnormal. Machine learning models are a method that automatically identifies anomalies by learning patterns from historical operational data. Their purpose is to provide more intelligent and adaptive anomaly detection capabilities, especially suitable for identifying unknown or complex anomaly patterns, and can self-adjust according to dynamic changes in system behavior. For example, supervised learning methods can be used to train a classifier (such as a support vector machine or neural network) using historical normal and abnormal data, enabling it to classify and judge new real-time data; or, unsupervised learning methods (such as clustering algorithms or outlier detection algorithms) can be used to identify outliers that deviate from the normal data distribution without explicit labels.

[0074] This application's solution addresses the limitations of traditional single-rule methods in detecting anomalies after system version updates by introducing diverse preset rules. After the target system is updated to a new version, the system continuously collects real-time data on various operational metrics and compares it with baseline data collected and stored before the update. This comparison process no longer relies solely on a single judgment logic but flexibly utilizes at least one of deviation rules, trend rules, combined rules, or machine learning models. Deviation rules can quickly capture sudden deviations in metrics, such as a sharp drop in performance metrics or a rapid increase in error rate, thus promptly identifying transient faults. Trend rules focus on identifying gradual deterioration of metrics, such as a slow increase in response time or a continuous rise in resource consumption, which is crucial for discovering long-term accumulated performance bottlenecks or potential faults. Combined rules, by comprehensively considering multiple related metrics, can effectively address complex scenarios where a single metric is not significant, but multiple metrics exhibit abnormalities in synergy, avoiding false positives or false negatives. Furthermore, the introduction of machine learning models endows anomaly detection with adaptive and intelligent learning capabilities. This allows for the automatic identification of complex, even unpredictable, anomaly patterns from historical data, and the dynamic adjustment of detection logic based on changes in the system's operating environment. This further enhances detection accuracy and the ability to identify unknown anomalies. Through this multi-dimensional and intelligent rule system, the proposed solution can more comprehensively and accurately identify various anomalies that may occur after system version updates, thus providing a reliable basis for subsequent automatic rollback decisions and ensuring the stable operation of the system after the update.

[0075] The following is a concrete example. In a scenario involving a version update of an industrial robot control system, after the firmware is updated, the system continuously monitors operational indicators such as motor temperature, joint vibration amplitude, motion trajectory deviation, and energy consumption. To comprehensively and accurately detect anomalies, the system can be configured with various preset rules. For example, a deviation rule can be set: if the real-time motor temperature data increases by more than 20% compared to the baseline data, it is considered an anomaly. Simultaneously, a trend rule can be configured: if the joint vibration amplitude continues to deteriorate for three consecutive monitoring cycles (e.g., the vibration amplitude value continues to increase), it is considered an anomaly. Furthermore, a combined rule can be used: when the joint vibration amplitude deteriorates for three consecutive cycles and the motor temperature exceeds the baseline by 20%, the system comprehensively judges it as a serious anomaly. In addition, a machine learning model can be trained. This model learns normal operating patterns based on historical data of motor temperature, vibration amplitude, energy consumption, and other indicators. Once the real-time data deviates from this pattern, the model outputs an anomaly signal. Through the synergistic effect of these rules, the system can effectively identify complex anomalies that are difficult to detect with a single rule. For example, when a robot experiences increased friction in a joint due to a firmware update, it may initially only manifest as a slight increase in vibration amplitude. However, if the condition continues to deteriorate and is accompanied by an increase in motor temperature, the combined rules or trend rules can promptly detect this potential fault, thereby triggering the subsequent automatic rollback process and preventing equipment damage.

[0076] By employing the aforementioned technical solutions, this application effectively addresses the problem that overly simplistic preset rules fail to comprehensively cover different types of anomaly patterns, thus affecting detection accuracy and adaptability and increasing the risk of false positives or false negatives. Specifically, by introducing deviation rules, sudden and significant changes in system operating indicators can be captured in a timely manner, effectively identifying transient anomalies; by introducing trend rules, gradual deterioration of indicators can be accurately identified, effectively discovering long-term accumulated performance bottlenecks or potential faults; by introducing combination rules, multiple related indicators can be comprehensively analyzed, effectively addressing complex anomaly scenarios that a single indicator cannot fully reflect, significantly improving detection accuracy and robustness; by introducing machine learning models, anomaly detection acquires adaptive and intelligent learning capabilities, automatically identifying complex, even unpredictable, anomaly patterns from historical data, and dynamically adjusting detection logic according to changes in the system operating environment. Therefore, the solution of this application provides a more comprehensive, accurate, and adaptable anomaly detection mechanism, significantly reducing the risk of false positives and false negatives, thereby ensuring the stability and reliability of the system after version updates.

[0077] In some of the solutions described above in this application, preset rules, including deviation rules, are proposed to detect anomalies. However, in the implementation process, deviation rules may not be able to adapt to dynamic changes in the system due to fixed thresholds, resulting in inaccurate detection, such as misjudging normal fluctuations as anomalies or missing real problems.

[0078] In response, this application further proposes deviation rules including: if a certain indicator meets a predetermined deviation condition compared with the baseline, it is judged as abnormal.

[0079] Among them, "deviation rules" are a type of rule used for anomaly detection. Their core idea is to determine the existence of anomalies by comparing the difference between current real-time data and preset baseline data. Their function is to provide a quantitative and objective basis for anomaly judgment. For example, this rule can be based on statistical principles to calculate the standard deviation and mean deviation between real-time data and baseline data; it can also be based on business experience, with fixed or dynamically adjusted thresholds preset by an expert system. "A certain indicator" refers to various performance, behavioral, or state parameters of the target system that can be quantified and monitored during operation. Its function is to serve as input data for anomaly detection, reflecting the system's operating status. For example, it can be a performance indicator reflecting system resource consumption, such as CPU utilization and memory usage; it can also be a user feedback indicator reflecting user experience, such as user satisfaction rating and interaction response time; or it can be a physical indicator reflecting the physical state of equipment, such as equipment temperature and vibration frequency. "Baseline" refers to the reference value or reference range of various indicators of the target system under normal and stable operating conditions. Its function is to serve as a comparison benchmark for judging whether real-time data is abnormal. For example, the baseline can be statistical data such as the average, median, or percentile over a period of time before the system update (e.g., the past 24 hours or 7 days); it can also be the instantaneous value before the system update, or an ideal value set by expert experience. "Predetermined deviation conditions" refer to the degree of difference or change pattern required for a real-time indicator to be judged as abnormal compared to the baseline. Its purpose is to clarify the quantitative standards for anomaly judgment and avoid subjective judgment. For example, it can be set as the percentage change between the real-time value and the baseline value, such as the real-time value exceeding the baseline value by 15%; it can also be set as the absolute difference between the real-time value and the baseline value, such as the real-time value being 10 units higher than the baseline value; or it can be based on statistical principles, such as the real-time value exceeding the baseline mean by two standard deviations. "Judging as abnormal" means that, based on the predetermined deviation conditions, the current system operating state deviates from the normal baseline and requires further attention or handling. Its purpose is to trigger subsequent anomaly handling procedures, such as alarms, logging, or automatic rollback. For example, an abnormal event or alarm signal can be generated internally by the system; it can also be by updating the abnormal status flag and recording the abnormal information in the log.

[0080] This application's solution involves collecting and storing baseline data representing various operational metrics before the target system is updated, and continuously collecting real-time data for these metrics after the target system is updated to the new version. The real-time data is then compared with the baseline data. A deviation rule, as a specific implementation of preset rules, is used to detect anomalies. Specifically, the deviation rule selects the real-time data of a specific operational metric and directly compares it with the corresponding baseline data. If this comparison result meets preset deviation conditions, such as the difference between the real-time value and the baseline value exceeding a preset threshold or percentage, the system determines the current operational state to be abnormal. This mechanism provides a clear and quantifiable basis for the step of "detecting anomalies through preset rules," making the anomaly detection process more objective and accurate. In this way, the system can promptly identify operational metrics that deviate significantly from the baseline state after the update, thus providing accurate input for subsequent anomaly handling and automatic rollback decisions.

[0081] The following is a concrete example. As a specific implementation, after a new version of the cloud service platform is launched, the system continuously monitors the API response time metric. Before the update, the system had collected and stored baseline data on API response times, for example, an average response time of 50 milliseconds. After the update, the system continuously collects real-time API response times. At this time, the deviation rule is configured as follows: if the real-time API response time exceeds a predetermined deviation condition of 30% compared to the baseline average, it is considered abnormal. This means that if the real-time API response time reaches or exceeds 65 milliseconds (50 milliseconds + 50 milliseconds * 30%), the system will immediately determine it as abnormal. Another example is that after an industrial equipment firmware update, the system monitors the motor temperature metric. Before the update, the system recorded the average baseline temperature of the motor during normal operation, for example, 40 degrees Celsius. After the update, the system continuously collects real-time motor temperatures. The deviation rule is configured as follows: if the real-time motor temperature exceeds a predetermined deviation condition of 5 degrees Celsius compared to the baseline average, it is considered abnormal. This means that if the real-time motor temperature reaches or exceeds 45 degrees Celsius, the system will determine it as abnormal.

[0082] Through the above technical solution, this application effectively solves the problem of inaccurate detection caused by fixed thresholds in traditional deviation rules by clearly defining the specific operational mechanism of the deviation rules. Specifically, it compares real-time indicators with a baseline and determines whether predetermined deviation conditions are met. This design makes the threshold for anomaly detection no longer a rigid, fixed value, but a "predetermined deviation condition" that can be flexibly configured according to system characteristics and business needs, thus better adapting to dynamic changes in system operation. This significantly enhances the accuracy and adaptability of anomaly detection, effectively reducing the risk of misjudging normal fluctuations as anomalies (reducing the false alarm rate) and failing to detect real problems in a timely manner (reducing the false negative rate). Therefore, this solution can more accurately and promptly identify potential anomalies after system version updates, providing a reliable basis for subsequent automatic rollback decisions, thereby ensuring the stability and reliability of the system.

[0083] In some of the solutions mentioned above in this application, preset rules are proposed to detect anomalies. However, in the implementation process, if the trend rules are not specifically defined on how to detect continuous deterioration, it may lead to the inability to identify the abnormal situation of the indicators gradually deteriorating in a timely manner, thereby increasing the risk of system failure.

[0084] In response, this application further proposes trend rules including: if a certain indicator continues to deteriorate for multiple consecutive monitoring periods, it is judged as abnormal.

[0085] The trend rule mentioned here is an anomaly detection method. Its core lies in identifying patterns in the changes of system operating indicators over time, especially persistent deterioration trends, rather than simply focusing on the deviation of instantaneous values ​​from the baseline. It aims to capture potential problems that develop slowly but have significant cumulative effects. This rule can be implemented by performing linear regression analysis on indicator data over multiple consecutive monitoring periods. If the slope of the regression line is negative (or positive, depending on the direction of the indicator's deterioration) and its absolute value exceeds a preset threshold, a persistent deterioration trend is considered to exist. Alternatively, statistical methods such as moving averages or exponential smoothing can be used to calculate the short-term trend value of the indicator and compare it with the long-term trend value or the trend value of the previous period. If the short-term trend value is consistently lower than (or higher than) the long-term trend value, or consistently changes in an unfavorable direction, it is determined to be deteriorating. The "indicator" refers to any performance, behavioral, or state parameter that can be quantified, monitored, and analyzed during system operation. For example, it can be a single key performance parameter, such as server CPU utilization or memory usage, or a user experience-related indicator, such as mobile application startup time or page loading speed. The term "continuous monitoring periods" refers to a series of closely connected data collection or analysis time periods in a time series. A fixed number of periods can be set, for example, requiring the indicator to show a deteriorating trend in the most recent 5 consecutive monitoring periods. Alternatively, the required number of consecutive periods can be dynamically adjusted based on the indicator's fluctuation characteristics and the system's tolerance for anomalies. "Continuous deterioration" means that the value of an indicator consistently changes in a direction detrimental to system stability or performance within consecutive monitoring periods. For example, it can be defined as the indicator value in each subsequent monitoring period being worse than the indicator value in the previous monitoring period, and this deterioration continuing for a preset number of consecutive periods. Alternatively, it can be defined as the average or median of the indicator continuously deviating from the baseline or normal range within multiple consecutive monitoring periods, with the degree of deviation gradually increasing. When the above conditions are met, the determination of an anomaly means that the system, based on preset rules and conditions, clearly identifies that the current operating state deviates from the normal baseline or expected behavior, requiring further attention or intervention. The system can immediately generate an anomaly event and record it in the log, simultaneously triggering subsequent decision-making processes or sending a notification to preset alarm recipients.

[0086] This application's solution introduces a trend rule: "If a certain indicator continuously deteriorates over multiple monitoring periods, it is judged as abnormal." This enables the anomaly detection engine to identify system problems that are not instantaneous and drastic fluctuations, but rather slow and progressive deterioration. After the target system is updated, the indicator acquisition module continuously acquires real-time data for various operational indicators. This real-time data is sent to the anomaly detection engine. When executing the trend rule, the anomaly detection engine continuously tracks the changes of specific indicators within a continuous time window. It does not simply compare the current real-time data with baseline data at a single point, but analyzes the trend of the indicator over multiple consecutive monitoring periods. For example, if an indicator (such as response time) slightly increases in the first monitoring period, continues to increase in the second monitoring period, and increases again in the third monitoring period, and this upward trend continuously meets the preset conditions of "multiple consecutive monitoring periods" and "continuous deterioration," then the anomaly detection engine will determine that the indicator is abnormal. This judgment mechanism can effectively capture hidden problems such as gradually declining system performance, gradually depleting resources, or gradually worsening user experience, avoiding false alarms caused by fluctuations at a single point in time, and preventing serious failures caused by the accumulation of problems. Once an anomaly is detected, the information will be transmitted to the decision-making module, which will then decide whether to execute an automatic rollback based on a preset decision-making strategy, thereby stopping losses in time and ensuring system stability.

[0087] As a specific implementation, suppose a cloud service platform needs to monitor its API response time after releasing a new version. The system continuously collects real-time data on API response time. To apply trend rules, the system can set a monitoring period of 5 minutes and require that the API response time continuously deteriorates within three consecutive monitoring periods to be considered abnormal. Specifically, the system records the average API response time within each 5-minute period. If the average response time is 100ms in the first 5-minute period, rises to 120ms in the second 5-minute period, and further rises to 150ms in the third 5-minute period, and all three values ​​are higher than the baseline average response time of 80ms before the update, then the system will determine that the API response time has a continuously deteriorating trend and trigger an anomaly alarm. In this case, "a certain indicator" is API response time, "several consecutive monitoring periods" refers to three consecutive 5-minute periods, and "continuous deterioration" means that the average response time of each subsequent period is higher than that of the previous period.

[0088] By employing the aforementioned technical solution, this method effectively addresses the problem of unclear traditional trend rules, which prevent timely identification of gradually deteriorating indicators. By clearly defining the criterion that "if an indicator continuously deteriorates over multiple monitoring periods, it is considered abnormal," this method can capture slowly developing system anomalies with significant cumulative effects, such as slow performance decline, gradual resource depletion, or a gradual deterioration in user experience. This not only improves the accuracy and timeliness of anomaly detection, reducing misjudgments caused by single-point fluctuations or transient changes, but also enhances the system's ability to identify progressive failures. This allows for timely intervention before problems escalate into serious malfunctions, effectively reducing system failure risks and ensuring stable operation after system updates.

[0089] In some of the solutions mentioned above in this application, anomaly detection methods are proposed to detect anomalies after system updates through preset rules. However, in the implementation process, fixed thresholds may lead to false alarms or missed alarms, and cannot adapt to the dynamic changes of different systems or scenarios, thereby affecting the accuracy and reliability of detection.

[0090] In response, this application further proposes a threshold adaptive adjustment step: dynamically adjusting the threshold for anomaly detection based on historical operational data.

[0091] This threshold adaptive adjustment step aims to dynamically modify or optimize the judgment thresholds used for anomaly detection based on the actual operating conditions of the system. Its core is to enable the anomaly detection mechanism to learn and adapt, rather than relying on static, preset fixed values. Through this step, the system can better cope with changes in the operating environment, load fluctuations, and the natural evolution of system behavior, thereby improving the accuracy and robustness of anomaly detection. Specifically, the anomaly detection threshold is dynamically adjusted based on historical operating data. Here, "historical operating data" refers to a collection of records of various operating indicators of the system over a past period (e.g., before an update, during stable operation, or under specific loads). This data contains behavioral patterns of the system under normal or specific states. Based on this historical data, the "dynamic adjustment" of the threshold can be achieved in various ways. For example, statistical methods can be used, such as calculating the mean, standard deviation, and percentiles of historical data, and setting dynamic upper and lower limits as the anomaly detection thresholds accordingly; alternatively, machine learning models can be used to learn the boundaries between anomaly and normal patterns by training on historical data, thereby outputting optimized thresholds adapted to the current system state. Another approach is based on a rule engine, triggering preset threshold adjustment strategies based on specific patterns or events observed in historical data.

[0092] This application's solution addresses the rigidity issue of traditional fixed thresholds in anomaly detection by introducing an adaptive threshold adjustment step, significantly improving the system's adaptability and detection accuracy. After the target system is updated to a new version, the system continuously collects real-time data of various operational indicators and compares it with baseline data collected and stored before the update, detecting anomalies using preset rules. To avoid misjudgments due to environmental or load changes, this solution dynamically adjusts the anomaly detection threshold based on historical operational data accumulated over a long period, either before or in parallel with anomaly detection. Specifically, the system analyzes the distribution, fluctuation range, and trend changes of various indicators in historical operational data to calculate the most suitable anomaly detection threshold. For example, if historical data shows that a certain indicator fluctuates periodically within a specific time period, the system will adjust the anomaly detection threshold for that period accordingly, allowing it to tolerate normal periodic fluctuations while still promptly detecting unexpected anomalies. It is precisely because of this mechanism of dynamically adjusting thresholds based on historical data that the preset rules for anomaly detection can always maintain a high degree of consistency with the actual operating state of the system. This allows for more accurate triggering of subsequent decision-making strategies when an anomaly is detected, determining whether to execute an automatic rollback, and ultimately ensuring the stability and reliability of the system after the update.

[0093] The following example illustrates this. Let's take anomaly detection after the launch of a new version of a cloud service platform as an example. Before the system update, the system collects and stores baseline data for key performance indicators such as API response time and error rate. After the new version is launched, the system continuously collects real-time data for these indicators. To ensure the accuracy of anomaly detection, the system initiates an adaptive threshold adjustment process. Specifically, the system can continuously collect historical API response time data from the past few weeks or even months and use this data to build a dynamic baseline model. For example, a moving average or exponential smoothing method can be used to calculate the dynamic average and standard deviation of API response time, and then the anomaly detection threshold is set to the dynamic average plus or minus two or three times the standard deviation. When historical data shows that API response times are generally higher during daily peak hours, the system automatically increases the anomaly detection threshold for that period to avoid misjudging normal traffic peaks as anomalies; while during off-peak hours, the threshold is lowered to increase sensitivity to subtle anomalies. Furthermore, the system can also identify the inherent error rate level of specific services or modules by analyzing historical error rate data and adjust their anomaly detection threshold accordingly. For example, if a service occasionally experiences a small number of errors in historical data, which is normal, the system will correspondingly relax the error rate threshold for that service to avoid frequent false alarms. In this way, when real-time collected API response time or error rate data is compared with the adaptively adjusted threshold, it can more accurately determine whether there are real anomalies, thereby avoiding false alarms or missed alarms caused by fixed thresholds and ensuring that subsequent automatic rollback decisions are more reasonable and effective.

[0094] Through the above technical solution, this application effectively solves the rigidity problem of traditional fixed thresholds in anomaly detection. By dynamically adjusting the anomaly detection threshold based on historical operating data, the system can better adapt to various changes in the operating environment, load fluctuations, and natural evolution of system behavior that may occur after the target system is updated. This significantly improves the accuracy and reliability of anomaly detection, reduces false positives and false negatives, and enables the system to more accurately trigger subsequent decision-making strategies when an anomaly is detected, determining whether to perform automatic rollback. Therefore, this solution not only improves the intelligence level of system self-healing but also reduces the frequency and cost of manual intervention, ensuring stable operation and user experience after system updates.

[0095] In some of the solutions mentioned above in this application, a decision-making strategy is proposed to determine whether to perform automatic rollback when an anomaly is detected. However, in this process, the decision-making strategy may lack flexibility and cannot be dynamically adjusted according to different operation and maintenance needs, risk preferences or the severity of the anomaly, resulting in overly rigid rollback decisions, which may lead to misoperation or response delays, affecting system stability and user experience.

[0096] In response, this application further proposes decision-making strategies including at least one of automatic rollback, semi-automatic rollback, or disabling rollback.

[0097] Specifically, automatic rollback refers to the system automatically triggering a rollback operation without manual intervention when an anomaly is detected. This strategy ensures a rapid response to severe anomalies, minimizing the impact of failures on system stability and user experience. For example, the system can preset a high-priority anomaly threshold; once real-time data meets the threshold condition when compared with baseline data, the process of restoring to the previous version and recovering relevant data from backup is immediately executed. Alternatively, by configuring the policy engine, specific types of anomaly events can be directly associated with automatic rollback commands; when such an event occurs, the system automatically performs a rollback.

[0098] The semi-automatic rollback refers to the system sending a notification to designated personnel after detecting an anomaly, and then executing the rollback operation after manual confirmation. This strategy aims to balance the efficiency of automation with the risk management of manual control, and is suitable for scenarios requiring manual assessment or confirmation. For example, the system can send anomaly alerts to maintenance personnel via email, SMS, or instant messaging tools, and provide a link or button containing a rollback confirmation option. The system only executes the rollback after the maintenance personnel click to confirm; alternatively, the system can display detailed anomaly information on the management interface, waiting for the administrator to log in and manually trigger the rollback operation.

[0099] Disabling rollback means that after detecting an anomaly, the system only records the anomaly information without performing any rollback operations. This strategy is suitable for low-risk scenarios, testing environments, or complex anomalies that require in-depth manual analysis, avoiding unnecessary rollbacks. For example, the system can record the anomaly event in detail in the log system or alarm platform, but without triggering any rollback process; or, after detecting an anomaly, the system can only send a notification to relevant personnel, but the rollback function remains inactive, with subsequent handling decided manually.

[0100] This application's solution addresses the issue of decision-making flexibility by defining the specific components of the decision-making strategy, ensuring that rollback operations can adapt to diverse operational scenarios. The decision-making strategy includes at least one of automatic rollback, semi-automatic rollback, or disabling rollback. This feature allows the system to select different strategies based on its configuration: automatic rollback is triggered directly without manual intervention in severe anomalies, reducing response latency; semi-automatic rollback requires manual confirmation before execution, balancing automation and control risks; disabling rollback only records anomalies without performing a rollback, suitable for low-risk or test environments. By including at least one strategy, this solution provides configurable flexibility, avoiding the limitations of a single strategy, thereby optimizing the decision-making process and improving system adaptability and user-friendliness.

[0101] Building upon the aforementioned methods, this application's solution continuously collects real-time data on various operational metrics after the target system is updated to a new version. This real-time data is compared with baseline data, and pre-defined rules are used to detect anomalies. Once an anomaly is detected, the system makes a judgment based on a pre-defined decision-making strategy. If the strategy is configured for automatic rollback, the system immediately initiates a process to restore to the previous version and recover relevant data from backups, thus achieving rapid self-healing. If the strategy is configured for semi-automatic rollback, the system pauses the rollback operation and sends a notification to pre-defined maintenance personnel, awaiting manual confirmation before executing the rollback. This provides a necessary intervention window for complex or high-risk scenarios. If the strategy is configured to disable rollback, the system only records the abnormal event without performing a rollback, which is particularly useful in test environments or when detailed analysis of the anomaly's cause is required. This flexible decision-making mechanism, combined with continuous anomaly detection, enables the system to take the most appropriate response measures based on the actual situation and risk appetite, significantly improving the system's stability and reliability after version updates.

[0102] The following example illustrates this. Let's take the anomaly detection and rollback after a new version of a cloud service platform is launched as an example. The platform continuously monitors operational metrics such as API response time, error rate, and user complaint rate. When the error rate exceeds twice the baseline for five consecutive minutes, the system determines it to be an anomaly. At this point, according to the preset decision-making strategy, the following processing methods are possible: As a specific implementation method, if the cloud service is a core payment system with extremely high stability requirements, the decision-making strategy can be configured for automatic rollback. Once an abnormal error rate is detected, the system will immediately trigger a recovery to the previous version without manual intervention, and restore service configuration and user data from the snapshot to ensure business continuity.

[0103] As another specific implementation, if the cloud service is a non-core auxiliary function and the anomaly may require further manual analysis, the decision-making strategy can be configured as a semi-automatic rollback. After the system detects an abnormal error rate, it will send an alarm notification to the operations and maintenance team. After assessing the severity and scope of the anomaly, the operations and maintenance personnel will manually confirm whether to perform a rollback operation.

[0104] As another specific implementation, if the cloud service is in the testing or canary release phase, and the main purpose is to collect abnormal data for analysis, the decision-making strategy can be configured to disable rollback. The system only records error rate anomalies in the log and does not perform rollback operations, so that developers can investigate the problem later and decide whether to intervene manually based on the analysis results.

[0105] Through the above technical solution, this application can flexibly select automatic rollback, semi-automatic rollback, or disable rollback based on the severity of the anomaly, the criticality of the system, and different operation and maintenance strategies, avoiding the limitations of a single rollback strategy. This enables the system to respond to anomalies more intelligently and accurately after version updates, effectively reducing the risk of misoperation, reducing the cost of manual intervention, and significantly improving the adaptability and stability of the system in different application scenarios, thereby optimizing the anomaly handling process after system version updates.

[0106] In some of the solutions described above in this application, a rollback step is proposed to automatically restore the system to a stable state. However, in this process, the specific implementation of the recovery version and the recovery data is unclear, which may lead to poor compatibility or low efficiency of the rollback operation in different system types. For example, it may not be able to adapt to the differences in software systems, hardware systems or firmware systems, or the risk of data inconsistency caused by incomplete backup and recovery.

[0107] In response, this application further proposes restoring to the previous version by invoking the version control module or firmware rollback mechanism; and restoring from backups by restoring from snapshots, images, or configuration backups.

[0108] Specifically, restoring to a previous version can include invoking a version control module. A version control module is a tool or system used to manage historical version changes of software code, documentation, or other digital assets. For example, for software systems, the application programming interface (API) of distributed or centralized version control systems such as Git or SVN can be called to perform `checkout` or `revert` operations, restoring the code repository or deployment directory to a specific version. Alternatively, it can be a version management function integrated within the system, which replaces the fileset of an application or service with the fileset of the older version through specific instructions or interfaces. Restoring to a previous version can also include invoking a firmware rollback mechanism. A firmware rollback mechanism is a recovery method for embedded software (firmware) in hardware devices, allowing the device to restore to a previously stable firmware version when firmware updates fail or encounter problems. For example, many embedded systems reserve a spare firmware partition; when the main firmware encounters a problem, the bootloader can switch to the spare partition to load the older version of firmware. For example, instructions can be sent to the device through specific hardware interfaces (such as JTAG, SPI) or device management protocols to trigger its internal firmware recovery logic, replacing the firmware image stored in non-volatile memory with an older version.

[0109] Restoring from backups can include restoring from snapshots. A snapshot is a copy of a system or data state at a specific point in time, typically stored incrementally, allowing for rapid recovery to that point. For example, for virtual machines or storage volumes, the snapshot management API provided by a virtualization platform (such as VMware vSphere, KVM) or storage system (such as ZFS, LVM) can be used to roll back the system state to a specific snapshot point. Alternatively, it can be an application-level data snapshot, using the database's transaction logs or specific tools to restore the database to a specific point in time. Restoring from backups can also include restoring from images. An image is an exact copy of a complete system or disk, containing the operating system, applications, and data, and can be used for rapid deployment or disaster recovery. For example, a target system can be redeployed using a pre-built older version system image through interfaces provided by deployment tools (such as PXE boot, Cloud-init) or cloud platforms (such as AWS AMI, Azure VMImage). Another example is a complete clone of a physical disk, where an older version of the disk image is written to the target storage device using disk cloning tools. Restoring from backups can also include restoring from configuration backups. Configuration backups are copies of critical configuration parameters for a system or application, used to maintain consistency during system recovery or migration. For example, this can be done by reading older versions of configuration files (such as YAML, JSON, or XML files) stored in a version control system or dedicated configuration management tool and applying them to the target system. Alternatively, it can involve importing pre-exported older configuration data, such as network device configurations and database parameter settings, through system management interfaces or command-line tools.

[0110] This application's solution addresses compatibility and reliability issues during the recovery process by explaining the specific implementation of the rollback operation, ensuring efficient rollback execution in various system environments. When an anomaly is detected after a system version update and an automatic rollback is initiated, this solution can flexibly select the appropriate version recovery mechanism based on the target system type (e.g., software system, hardware system, or firmware system). For software systems, the version control module can be invoked to restore the code version; for hardware or firmware systems, the firmware rollback mechanism can be triggered to restore the firmware version. This flexibility avoids rollback failures or operational complexity caused by differences in system architecture. Furthermore, regarding data recovery, this solution provides diverse backup and recovery options, including recovery from snapshots, images, or configuration backups. Snapshots can be used for rapid system state restoration, images for complete data restoration, and configuration backups to ensure parameter consistency. These diverse recovery methods ensure the integrity and consistency of data recovery, reduce the risk of data loss, and adapt to different backup strategy requirements. Through these specific recovery methods, this solution makes the key step of "restoring to the version before the update and recovering relevant data from the backup" in the above-mentioned anomaly detection and automatic rollback method concrete, reliable and efficient, thereby enhancing the practicality and robustness of the entire method.

[0111] The following is a concrete example. In a cloud service platform, after a new version is launched, the system continuously monitors metrics such as API response time and error rate. Once the error rate exceeds twice the baseline for five minutes, the system will trigger an automatic rollback. At this time, the system will invoke the version control module, for example, by executing Git commands to roll back the deployed service code to the previous stable version. Simultaneously, to restore relevant data, the system will restore the database state from a pre-created database snapshot and load the old version of the service configuration file from the configuration backup, ensuring the consistency of the service environment.

[0112] Through the above technical solution, this application can provide a rollback mechanism with strong compatibility and high reliability, effectively solving the problem of unclear specific implementation methods for version recovery and data recovery, ensuring that rollback operations can be performed efficiently and completely under different system types, thereby improving the overall stability and reliability of the anomaly detection and automatic rollback method after system version update.

[0113] In some of the solutions mentioned above in this application, a target system is proposed to perform anomaly detection and automatic rollback after version update. However, in this process, since the target system type is not specifically limited, the applicability of the method on different system types may be limited, and it is impossible to optimize the indicator collection, anomaly detection and rollback execution process for the specific characteristics of software, hardware, firmware or software-hardware combined system.

[0114] The technical solution of this application can be applied to various systems that require anomaly detection and automatic rollback after version updates. As an example, the target system can be a software system, a hardware system, a firmware system, or a combination of software and hardware. Those skilled in the art should understand that these are merely illustrative examples, and the scope of protection of this invention is not limited thereto.

[0115] In practical applications, monitoring metrics and rollback methods can be adapted according to the type of the target system. For example: - When the target system is a software system, you can focus on monitoring software-level metrics such as code execution efficiency, memory usage, API response time, and error logs, and use a version control system to roll back the code.

[0116] - When the target system is a hardware system, the focus can be on monitoring physical sensor data, such as temperature, voltage, current, vibration frequency, etc., and when an anomaly is detected, appropriate hardware recovery methods can be used.

[0117] - For systems that combine software and hardware, metrics data from both software and hardware can be integrated for comprehensive analysis, enabling more comprehensive anomaly detection and coordinated rollback.

[0118] It should be noted that the above adaptation methods are merely examples. Those skilled in the art can select appropriate monitoring indicators and rollback strategies based on the actual system characteristics, without departing from the protection scope of this invention.

[0119] As a specific implementation method, an industrial robot control system can be considered. As an example, this control system can be viewed as a hardware and software integrated system, whose version updates may involve upgrades to the control software and underlying firmware. After a system update, this method continuously collects multi-dimensional operational metrics. For example, for the software component, it can monitor the execution cycle of the motion control algorithm, task scheduling delays, error logs, etc.; for the hardware component, it collects the temperature, current, vibration amplitude, and actual deviation of the motion trajectory of the joint motors through sensors. When an anomaly is detected, for example, a sustained increase in joint motor temperature accompanied by a motion trajectory deviation exceeding a preset range, the system determines it to be abnormal. At this time, the decision module triggers the rollback execution module according to a preset strategy (e.g., automatic rollback). The rollback execution module first calls the version control module to restore the control software to the version before the update. Simultaneously, if the firmware also has anomalies, it triggers the firmware rollback mechanism to restore the underlying firmware to a stable version. Furthermore, the system also restores relevant calibration parameters and operating mode data from a pre-created configuration backup to ensure the system returns to its stable operating state before the update.

[0120] The method presented in this application can provide customized anomaly detection and automatic rollback strategies for different types of systems. This solves the problem of limited applicability of traditional methods due to their lack of specificity when dealing with diverse systems. This detailed classification and adaptation makes indicator collection more accurate, anomaly detection more effective, and rollback operations more reliable, thereby significantly improving the stability and security of the system after version updates and reducing the possibility of risks introduced by version updates.

[0121] Traditional system version update methods lack real-time monitoring of system operation status after the update, making it impossible to detect problems such as performance degradation or reduced user satisfaction in a timely manner. Furthermore, the absence of anomaly detection mechanisms often relies on user-initiated reports, and rollback operations are highly dependent on manual intervention, resulting in delayed responses. In addition, rigid threshold settings cannot be adjusted according to dynamic changes in different systems and application scenarios, leading to false alarms or missed alarms. The system also lacks self-healing capabilities, and failures may continue to affect user experience, or even cause data loss or security incidents.

[0122] To address this, this application proposes an anomaly detection and automatic rollback system after a system version update, comprising: an indicator acquisition module for collecting baseline and real-time data of operating indicators before and after the target system update; a monitoring module for continuously monitoring real-time data; an anomaly detection engine for detecting anomalies according to preset rules; a decision module for deciding whether to perform automatic rollback according to preset decision strategies; and a rollback execution module for restoring to the version before the update and restoring relevant data from backups.

[0123] This application utilizes an indicator acquisition module to collect baseline and real-time data of operational indicators before and after the target system update, providing fundamental data support for the entire monitoring process and ensuring that subsequent analysis is based on a comparison before and after the update. The monitoring module continuously monitors real-time data, enabling real-time tracking of the system status and avoiding the risks associated with monitoring lag. The anomaly detection engine detects anomalies based on preset rules, rather than relying on fixed thresholds. It achieves intelligent identification through rules such as deviation or trend analysis, solving the problem of rigid thresholds. The decision-making module determines whether to execute automatic rollback based on preset decision strategies, providing flexible strategies such as automatic or semi-automatic modes, reducing the need for manual intervention. The rollback execution module restores the system to the version before the update and recovers relevant data from backups, achieving rapid system self-healing and overcoming the deficiency of lacking automatic recovery capabilities.

[0124] Through the above technical solutions, the system can ensure system stability in real time, minimizing the impact time of failures; reduce the cost of manual intervention, eliminating the need for continuous monitoring and manual operation; accurately detect anomalies, reducing false alarms and missed alarms by combining deviation rules, trend rules, and combination rules; adaptively adjust thresholds dynamically according to different systems and application scenarios, avoiding a one-size-fits-all approach; ensure that relevant personnel are notified at critical nodes and are always aware of the system status; provide configurable policies, allowing users to choose automatic, semi-automatic, or disabled rollback according to their own needs, flexibly adapting to different scenario requirements; and this solution can be applied to various scenarios such as industrial equipment, robots, automotive systems, smart homes, and pure software services.

[0125] In some of the solutions mentioned above in this application, an anomaly detection engine is proposed to detect system anomalies. However, in this process, since the anomaly detection threshold is fixed and cannot be adaptively adjusted according to the historical operating data of the system, the threshold setting becomes rigid, which can easily lead to false alarms or missed alarms, affecting the accuracy of anomaly detection and system stability.

[0126] In this regard, this application further proposes that the above system also includes a threshold adaptive module, which is used to dynamically adjust the threshold of anomaly detection based on historical operating data.

[0127] The threshold adaptive module is a functional unit whose main responsibility is to intelligently adjust the judgment thresholds used in the anomaly detection process based on the actual system operation and historical data. It is not a simple fixed-value storage device, but a component with analytical and decision-making capabilities. This module can be an independent software service or process that interacts with the anomaly detection engine through an application programming interface (API), receiving historical operational data and outputting the adjusted thresholds. Alternatively, this module can be a sub-functional unit within the anomaly detection engine, integrated into the anomaly detection logic, responsible for updating internal threshold parameters before each detection or periodically. Furthermore, this module can also be a component based on a machine learning model, predicting the optimal threshold by training on historical data.

[0128] The technical feature of "dynamically adjusting anomaly detection thresholds based on historical operational data" aims to solve the problems of false alarms and false negatives caused by fixed thresholds. By utilizing data accumulated over long-term system operation, the judgment criteria for anomaly detection are made more closely aligned with the actual behavior patterns of the system. Historical operational data can include statistical information such as the average, standard deviation, maximum, minimum, and trend of various operational indicators over a past period (e.g., the most recent 24 hours, 7 days, or a longer period). Dynamic adjustment can be based on statistical methods, such as setting the threshold as the historical average plus or minus a certain standard deviation multiple, or determining it based on the quantiles of historical data. Dynamic adjustment can also be achieved through machine learning algorithms, such as using regression models to predict the normal range of indicators over a future period, or using anomaly detection algorithms (such as Isolation Forest, One-Class SVM) to learn normal behavior patterns from historical data and set dynamic anomaly boundaries accordingly. Dynamic adjustment can also combine business rules and expert experience; for example, during specific business peak periods, the thresholds of certain indicators can be appropriately relaxed, while they can be tightened during system maintenance periods.

[0129] In the anomaly detection and automatic rollback system after the system version update, the indicator acquisition module continuously collects the target system's operational indicators, the monitoring module continuously monitors this real-time data, and the anomaly detection engine makes anomaly judgments based on preset rules. However, if the thresholds used by the anomaly detection engine are fixed, these fixed thresholds may no longer be applicable when the system's operating environment, load, or behavior patterns change, leading to false alarms or missed alarms. To address this issue, this application introduces a threshold adaptive module. This module works closely with the anomaly detection engine, and its core working principle is that it no longer relies on statically preset thresholds but actively and continuously analyzes the system's long-term accumulated historical operational data. This historical data reflects the system's normal behavior patterns under different times and loads. By performing statistical analysis or pattern learning on this historical data, the threshold adaptive module can identify the normal fluctuation range and trend of the system's operational indicators. Based on this deep understanding of the system's "normal" state, this module can dynamically calculate and update the thresholds used by the anomaly detection engine. For example, when system load generally increases, the threshold adaptive module can correspondingly raise the abnormal thresholds for certain performance indicators to avoid misjudging normal performance fluctuations as anomalies. Conversely, when the system is under low load or during periods with higher stability requirements, the thresholds can be tightened to increase anomaly sensitivity. In this way, the threshold adaptive module enables the anomaly detection engine's judgment criteria to adapt to the actual operating conditions of the target system in real time and intelligently. It ensures the accuracy of anomaly detection and reduces the risk of invalid alarms or missed real problems due to inappropriate thresholds. This dynamic adjustment mechanism allows the entire anomaly detection and automatic rollback system to maintain its robustness and effectiveness in the face of complex and ever-changing operating environments, thereby more reliably ensuring the stability of the system after updates.

[0130] As a specific implementation, the threshold adaptive module can be configured as a standalone microservice that periodically retrieves historical data on various operational metrics of the target system over the past 7 days from the data store. For example, for the CPU utilization metric, this module can calculate the average and standard deviation of hourly CPU utilization over the past 7 days. It then sets the anomaly detection threshold to "average + 3 times the standard deviation". This means that the anomaly detection engine will only classify a system as an anomaly when its real-time CPU utilization exceeds this dynamically calculated threshold. For instance, if a target system's CPU utilization typically fluctuates between 40% and 60% on weekdays and between 10% and 20% on weekends, a traditional fixed threshold might be set to 70%. However, on weekends, even if CPU utilization reaches 30%, it might be falsely flagged as an anomaly. With the threshold adaptive module, the threshold could be dynamically adjusted to 75% on weekdays (e.g., 60% + 3 * 5%), and to 25% on weekends (e.g., 20% + 3 * 1.6%). In this way, the anomaly detection engine can more accurately identify genuine anomalies, such as a sudden spike in CPU utilization to 40% over the weekend, which can be detected promptly. Furthermore, the threshold adaptation module can employ machine learning models, such as a prediction model based on a Long Short-Term Memory (LSTM) network. This model can continuously learn from historical time-series data of various metrics to predict the normal range of the metrics over a future period. When real-time data exceeds the predicted range, it is considered an anomaly. This approach can capture more complex time-series patterns and seasonal variations, further improving the intelligence of threshold adjustment.

[0131] By introducing a threshold adaptive module that dynamically adjusts the anomaly detection threshold based on historical operational data, this application effectively solves the rigidity problem of traditional fixed thresholds in anomaly detection. After a system version update, the target system's operating environment, load mode, or internal behavior may change, rendering the original fixed thresholds inapplicable and prone to false positives or false negatives. The threshold adaptive module, through continuous learning and analysis of historical system operational data, can intelligently identify dynamic changes in the system's normal behavior patterns and adjust the anomaly detection criteria in real time accordingly. This makes the anomaly detection engine's judgment more accurate, significantly reducing the probability of misjudging normal fluctuations as anomalies, while also improving sensitivity to genuine anomalies and avoiding false negatives. Therefore, the entire anomaly detection and automatic rollback system can more accurately and reliably identify potential problems after system updates, thereby triggering automatic rollback in a timely manner when necessary, maximizing system stability and user experience, and reducing the cost of manual intervention and troubleshooting.

[0132] In some of the solutions mentioned above in this application, an anomaly detection and automatic rollback system is proposed for real-time monitoring and automatic handling of anomalies. However, in this process, users or operation and maintenance personnel may not be able to know in a timely manner when anomaly detection or rollback decisions occur, resulting in a lack of right to know and an inability to perform manual intervention or analysis when needed.

[0133] In this regard, this application further proposes that the above system also includes a user notification module, which is used to send notifications to users when anomaly detection or rollback decisions are made.

[0134] The user notification module is an integral part of the system, primarily responsible for generating, formatting, and sending various notification messages to designated recipients. This module can be a standalone software component or service, using external communication interfaces such as message queues, email services, SMS gateways, or instant messaging APIs to send notifications; alternatively, it can be a functional unit integrated into the system's core logic, directly interacting with the anomaly detection engine and decision-making module, and internally maintaining notification templates and recipient lists. Its role is to ensure that critical system events (such as anomaly occurrences or rollback decisions) are communicated to relevant personnel in a timely and accurate manner, thereby guaranteeing information transparency and the possibility of human intervention. The user notification module's function is to send notifications to users during anomaly detection or rollback decisions, clearly defining its triggering conditions and core functionality. The user notification module is activated when the anomaly detection engine determines an anomaly or when the decision-making module makes a rollback decision (whether automatic, semi-automatic, or with rollback disabled). This function can be implemented through an event-driven mechanism: after completing their respective judgments, the anomaly detection engine and decision-making module publish corresponding events, and the user notification module subscribes to these events and executes the notification sending operation. The notification content can include the anomaly type, severity, scope of impact, rollback decision results, and recommended actions. "User" can refer to system administrators, operations personnel, developers, product managers, or even end users (in some scenarios). Notification methods can include email, SMS, instant messaging, internal system alert interfaces, API callbacks, and more.

[0135] This application's solution introduces a user notification module, enabling the entire anomaly detection and automatic rollback system to promptly inform relevant personnel of critical state changes during operation. Specifically, as the indicator acquisition module and monitoring module continuously acquire updated operational data from the target system, the anomaly detection engine analyzes this real-time data according to preset rules and compares it with baseline data to detect any anomalies. Once the anomaly detection engine detects an anomaly, it immediately triggers the user notification module, sending anomaly notifications to preset users. Subsequently, the decision-making module, based on preset decision strategies and the severity of the anomaly, decides whether to perform an automatic rollback. Regardless of the decision outcome (e.g., automatic rollback, semi-automatic rollback requiring manual confirmation, or simply recording the anomaly without rollback), the decision-making module will again trigger the user notification module, sending the corresponding decision notification to the user. This mechanism ensures that relevant personnel receive timely information when critical system states change, thereby making the system's internal state (anomaly occurrence, rollback decision) transparent and solving the problem of users or operations personnel being "unaware." This transparency not only enhances the system's reliability but also provides a window for human intervention, especially under a semi-automatic rollback strategy, where user confirmation is a crucial step in the rollback execution. Even with automatic rollback, notifications allow users to understand the system's self-healing process, facilitating subsequent analysis and optimization. Therefore, the user notification module works closely with the anomaly detection engine and decision-making module to jointly build a more complete, controllable, and user-friendly system.

[0136] The following is a concrete example. Taking the automatic rollback after a new version release of a cloud service platform as an example, after the new version of the cloud service platform goes live, the metric collection module and monitoring module continuously monitor metrics such as API response time, error rate, and user complaint rate. If the anomaly detection engine finds that the API error rate exceeds twice the baseline and persists for a period of time, it is judged as an anomaly. At this time, the user notification module will immediately send an SMS notification to the pre-set mobile phones of the operations team and product manager, with the content: "[Cloud Service Alert] After the new version is released, the API error rate has increased abnormally, triggering an alert. Please pay attention." At the same time, a detailed anomaly report is sent to the operations team's email address. Subsequently, the decision-making module decides to execute the automatic rollback according to the pre-set automatic rollback strategy. The user notification module is triggered again, sending SMS and email notifications to the operations team and product manager: "[Cloud Service Notification] The system has started automatic rollback to the previous stable version, which is expected to be completed in a short time. Please pay attention to the service recovery status." After the rollback execution module completes the recovery, the user notification module can also send a final notification: "[Cloud Service Notification] The system has successfully rolled back to the previous version, and the service has returned to normal." The above technical solutions ensure that relevant personnel (such as operations and maintenance personnel and system administrators) can understand the changes in the system's operational status after version updates in a timely and accurate manner, solving the problem of information asymmetry and improving system transparency. At critical moments when the system detects anomalies or makes rollback decisions, users can be notified promptly, allowing them the opportunity for manual intervention, such as confirmation in semi-automatic rollback scenarios or root cause analysis and follow-up processing after automatic rollback. This avoids the escalation of problems or delays in handling due to information lag, thereby improving the system's controllability and reliability. Integration with the underlying system makes the entire anomaly detection and automatic rollback process more complete and user-friendly, enhancing the system's usability and user experience.

[0137] In some of the embodiments described above in this application, a rollback execution module is proposed to automatically restore to the version before the update and restore relevant data from the backup. However, in its implementation, the rollback operation may lack a specific execution mechanism, resulting in low efficiency, incomplete data recovery, or inconsistent system state.

[0138] In response, this application further proposes configuring the rollback execution module to call the version control module to restore the version before the update and restore the relevant data from the snapshot.

[0139] The rollback execution module is the core component of the system for performing rollback operations. Its role is to coordinate and execute all necessary steps to restore the target system to its stable state before the update upon receiving a rollback command. This module can be a standalone software service, a functional unit integrated into the system management platform, or, in an embedded system, a firmware-level rollback manager. Its design aims to ensure the automation, reliability, and integrity of the rollback process. Calling the version control module to restore to the previous version means that the rollback execution module interacts with a dedicated version control system or mechanism to restore the target system's code, configuration, firmware, or any versioned components to a specific version before the update. One implementation is that the rollback execution module can integrate API calls to a distributed version control system (such as Git) to restore application code or configuration files by executing specific rollback commands (such as `git revert` or `git reset --hard` to a specific commit). Another implementation is that, for hardware or firmware systems, the rollback execution module can trigger the device's built-in firmware rollback mechanism, for example, by switching to a pre-stored older firmware partition (such as an A / B partitioning scheme) or by loading an older firmware image through specific hardware instructions. Another implementation approach is that, for operating systems or software packages, the rollback execution module can call the system's built-in package manager or system restore tool to restore system software packages or critical system files to their pre-update state. Recovering relevant data from snapshots refers to the rollback execution module using system or storage-level snapshot technology to restore data related to the target system's running state to its pre-update state. Snapshots are data copies at a specific point in time, providing data consistency guarantees. One implementation approach is that the rollback execution module can integrate with the APIs of virtualization platforms (such as VMware vSphere, KVM) or cloud service providers to trigger snapshot recovery operations on virtual machines or cloud disks, thereby restoring the entire virtual environment or data volume. Another implementation approach is that, for database systems, the rollback execution module can call the snapshot recovery function provided by the database management system (DBMS) or recover data from pre-created logical or physical snapshots of the database. Yet another implementation approach is that, for file systems, the rollback execution module can utilize file system-level snapshot functionality (such as ZFS snapshots, LVM snapshots) to restore the file system to its pre-update state, ensuring the consistency of configuration, logs, or user data.

[0140] This application's solution achieves efficient, complete, and consistent automatic rollback in case of system version updates under abnormal circumstances by configuring the rollback execution module to call the version control module and restore relevant data from snapshots. Specifically, when the anomaly detection engine identifies a system anomaly and the decision module decides to execute an automatic rollback, the rollback execution module first activates the version control module to precisely roll back the target system's program code, configuration files, or firmware version to a stable state before the update. This process leverages the version control system's precise management capabilities of historical versions, ensuring the correct restoration of the system's core logic. Next, the rollback execution module triggers the restoration of relevant data from a pre-created snapshot. Snapshot technology captures the data state at a specific point in time, including the state of the database, file system, or the entire virtual machine, thereby ensuring data integrity and consistency matching the pre-update version. Through this synergy, the system can not only restore to the old program version but also ensure a matching data state, avoiding secondary failures caused by program-data mismatch and greatly improving the reliability and success rate of the rollback operation. This mechanism makes the entire automatic rollback process highly automated and accurate, effectively solving problems such as inefficiency, incomplete data recovery, or inconsistent system states that may occur during rollback operations, thereby ensuring the stable operation of the system after the update.

[0141] The following is a concrete example. Suppose that after a cloud service platform releases a new version, its anomaly detection engine detects a significant increase in API response time and a continuously rising error rate. The decision-making module determines that an immediate automatic rollback is necessary. At this point, the rollback execution module will be activated. First, it sends an instruction to the platform's internal version control system (e.g., an enterprise-level code repository based on Git), requesting that the codebase of the affected microservice instances be rolled back to a specific commit point of the previous stable version. Upon receiving the instruction, the version control system automatically deploys the old version of the code to the corresponding server or container. Simultaneously, the rollback execution module sends a request to the cloud service provider's storage management service, triggering a snapshot recovery operation on the database volume or persistent storage associated with the microservice. For example, if the database is running on a virtual machine, the snapshot of that virtual machine is restored; if a cloud database service is used, the database snapshot created before the update is restored. Through the coordinated execution of these two steps, the system not only restores the old version of the application logic but also restores the data state compatible with that logic, thereby ensuring that the service can quickly and stably recover to its normal operating state before the update.

[0142] Through the above technical solution, this application effectively addresses the problems of lack of specific execution mechanisms, low efficiency, incomplete data recovery, or inconsistent system states in rollback operations. The rollback execution module, by invoking the version control module, can accurately and efficiently restore system programs and configurations to the stable version before the update, avoiding the complexity and potential errors of manual operations. Simultaneously, restoring relevant data from snapshots ensures a high degree of matching and consistency between data and program versions, preventing system malfunctions caused by incomplete or mismatched data. This combination makes the automatic rollback process more reliable and comprehensive, significantly improving the system's self-healing ability in abnormal situations and minimizing the impact of failures on system stability and user experience.

[0143] In some of the solutions mentioned above in this application, methods are proposed to realize anomaly detection and automatic rollback after system version update. However, in this process, if there is no standardized computer-readable storage medium to store and execute these methods, it may lead to inconsistent method implementation, deployment difficulties, low execution efficiency, and difficulty in ensuring reliability and maintainability, thereby affecting the stability and scalability of automated detection and rollback.

[0144] In this regard, this application proposes a computer-readable storage medium having a computer program stored thereon, wherein the program implements a method when executed by a processor.

[0145] The computer-readable storage medium refers to a physical device capable of storing digital data and computer instructions. This medium can be non-volatile memory, such as hard disk drives (HDDs), solid-state drives (SSDs), flash memory (such as USB drives and SD cards), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), or electrically erasable programmable read-only memory (EEPROM). Furthermore, in certain specific application scenarios, a program may be loaded into volatile memory (such as random access memory, RAM) for execution at runtime, but the program itself is usually persistently stored on a non-volatile medium. This medium provides persistent storage space for computer programs, ensuring that the program is not lost after power failure and can be read and executed by the processor at any time. The computer program is a collection of instructions designed to instruct the computer to perform specific tasks or operations. This program can be a compiled program, such as one written in high-level languages ​​like C / C++, Java, or Go and compiled into machine code or bytecode; or an interpreted program, such as one written in scripting languages ​​like Python, JavaScript, or PHP, which is executed line by line by an interpreter at runtime. This computer program encapsulates the logic and steps of the aforementioned anomaly detection and automatic rollback method, enabling its automated execution. When the program is executed by a processor, the processor (e.g., CPU, GPU, DSP, etc.) reads and operates according to the instruction sequence in the computer program to complete the function defined by the program. The processor can execute program instructions sequentially or in parallel through instruction cycles (including fetch, decode, execute, and write-back stages). In some cases, dedicated processors (such as FPGAs) can also efficiently execute specific types of computational tasks, such as anomaly detection in machine learning models, through their specific architecture and instruction sets. This process transforms abstract method logic into actual computational operations, driving the system to complete a series of automated processes, including anomaly detection, decision-making, and rollback.

[0146] This application defines a computer-readable storage medium to store the computer program implementing the method, ensuring the standardization, deployability, and automated execution of the anomaly detection and automatic rollback method. Specifically, the computer-readable storage medium provides a reliable storage foundation for persistently storing the program, avoiding inconsistencies in execution caused by temporary or manual deployment; storing the computer program allows the method to exist in executable code form, facilitating distribution and updates; the method is implemented when the program is executed by the processor, defining the automatic execution of the above method when the processor runs the program, eliminating the need for manual intervention and improving execution efficiency and reliability. Each feature works synergistically to solve the deployment difficulties and execution inefficiencies caused by the lack of a carrier during the method implementation process. This scheme enables the complex logic and automated processes described in the above method to run in a unified and efficient manner in different environments without manual intervention, thereby significantly improving the timeliness, accuracy, and automation level of anomaly handling after system updates.

[0147] In one specific implementation scenario, within a cloud service deployment, the computer-readable storage medium can be a server's solid-state drive (SSD). This SSD stores a compiled Java application, which is the computer program containing all the logic and instructions for implementing anomaly detection and automatic rollback methods after system version updates. When the server's central processing unit (CPU) starts the Java application, the CPU loads the program code from the SSD into memory and begins executing these instructions. These instructions drive the server to perform a series of operations, such as continuously collecting real-time data of various cloud service performance metrics and comparing it with pre-stored baseline data to detect anomalies using preset rules. Once an anomaly is detected, the program decides whether to perform an automatic rollback based on a preset decision-making strategy, and if necessary, calls the version control module to restore to the previous version and recover relevant data from snapshots.

[0148] Through the above technical solution, this application provides a standardized carrier, enabling the anomaly detection and automatic rollback methods after system version updates to be deployed and executed in a unified and reliable manner. This significantly improves the consistency of method implementation, ease of deployment, and execution efficiency, reduces the need for manual intervention, and thus enhances the stability and maintainability of automated detection and rollback. This solution ensures that the method logic can be efficiently and accurately translated into actual system operations, providing a solid technical guarantee for the stable operation of the system after updates.

[0149] Other application scenarios The following simplified embodiments illustrate the application of the present invention in other fields. Specific implementations of these embodiments can be found in the examples described in the detailed embodiments above, and will not be repeated here.

[0150] Simplified Example 1: Anomaly Detection in Autonomous Vehicle Systems In intelligent vehicles, after the autonomous driving control unit software is updated, the system monitors vehicle status (vehicle speed, acceleration), sensor data (radar, camera), communication bus load, and other indicators. When an increase in bus packet loss rate and abnormal sensor data are detected, the system performs a semi-automatic rollback (requiring driver confirmation) to ensure driving safety.

[0151] Simplified Example 2: Firmware Upgrade for Smart Home Devices After the smart speaker firmware is upgraded, the system monitors metrics such as response latency, wake-up success rate, and user complaint rate. When the wake-up success rate drops by more than 30%, the system automatically rolls back to the previous version and notifies the user.

[0152] Simplified Example 3: Anomaly Detection After Mobile Application Gray-Scale Release After a new version of a mobile application is released in a canary release phase, monitor metrics such as user crash rate, startup time, and page load speed. If the crash rate exceeds twice the baseline, the system automatically pauses the canary release, rolls back to the previous version, and sends a report to the product manager.

[0153] The above description is merely an embodiment of this application and is not intended to limit the scope of protection of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.

[0154] This solution is applicable not only to local devices but can also be deployed on cloud servers. Those skilled in the art should understand that the technical solution of this invention is not limited to a specific deployment environment, and any implementation based on the technical concept of this application should be considered to fall within the protection scope of this invention.

Claims

1. A method for anomaly detection and automatic rollback after a system version update, characterized in that, Includes the following steps: Collect and store baseline data representing various operational indicators of the target system before the update; After the target system is updated to a new version, real-time data of various operational metrics are continuously collected; Compare real-time data with baseline data and detect any anomalies using preset rules; When an anomaly is detected, a decision is made based on a preset decision-making strategy to determine whether to perform an automatic rollback. If an automatic rollback is performed, the system will restore the previous version and recover the relevant data from the backup.

2. The method according to claim 1, characterized in that, The operational metrics include, but are not limited to, at least one of the following: performance metrics, user feedback metrics, system interaction metrics, or business metrics.

3. The method according to claim 2, characterized in that, The user feedback metrics include, but are not limited to, at least one of the following: negative feedback rate, user dissatisfaction rate, or interaction interruption rate.

4. The method according to claim 1, characterized in that, The preset rules include, but are not limited to, at least one of the following: deviation rules, trend rules, combination rules, or machine learning models.

5. The method according to claim 4, characterized in that, The deviation rules include: if a certain indicator meets a predetermined deviation condition compared with the baseline, it is judged as abnormal.

6. The method according to claim 4, characterized in that, The trend rule includes: if a certain indicator continues to deteriorate for multiple consecutive monitoring periods, it is judged as abnormal.

7. The method according to claim 1, characterized in that, It also includes a threshold adaptive adjustment step: dynamically adjusting the threshold for anomaly detection based on historical operational data.

8. The method according to claim 1, characterized in that, The decision-making strategy includes at least one of automatic rollback, semi-automatic rollback, or disabling rollback.

9. The method according to claim 1, characterized in that, Restoring to the previous version includes invoking the version control module or firmware rollback mechanism; restoring from backup includes restoring from snapshots, images, or configuration backups.

10. A system for anomaly detection and automatic rollback after a system version update, characterized in that, include: The indicator acquisition module is used to collect baseline and real-time data of operational indicators before and after the target system is updated; The monitoring module is used to continuously monitor real-time data; Anomaly detection engine is used to detect whether anomalies exist based on preset rules; The decision-making module is used to determine whether to perform automatic rollback based on preset decision-making strategies; The rollback execution module is used to restore the version before the update and restore relevant data from backup.

11. The system according to claim 10, characterized in that, It also includes a threshold adaptive module, which is used to dynamically adjust the threshold for anomaly detection based on historical operational data.

12. The system according to claim 10, characterized in that, It also includes a user notification module, which is used to send notifications to users when anomalies are detected or rollback decisions are made.

13. The system according to claim 10, characterized in that, The rollback execution module is configured to call the version control module to restore the version before the update and restore the relevant data from the snapshot.

14. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method of any one of claims 1 to 9.