[0054] The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
[0055] The embodiment of the invention discloses a distributed intelligent operation and maintenance platform, see figure 1 As shown, the platform includes:
[0056] The task scheduling module is used to generate the target task using the task scheduling information input by the user and multiple atomic tasks.
[0057] Specifically, the user inputs task scheduling information through the human-computer interaction device, and combines or splits multiple atomic tasks stored in the database in advance to generate a new target task. For example, the atomic task is saved according to the function. Each atomic task corresponds to a function. The user can use the mouse to input task scheduling information, drag and drop the atomic tasks corresponding to the function in turn to form a new multi-functional target task, achieve visual scheduling tasks, and have strong flexibility This improves the efficiency of task scheduling.
[0058] The task allocation module 2 is used to use the task allocation information input by the user to allocate the target task to the target device.
[0059] The task execution module 3 is used to use the task execution information input by the user to make the target device run the target task.
[0060] Specifically, after the target task is obtained, the task assignment information and task execution information input by the user can be received in turn, the device information recorded in the task assignment information can be used to assign the target task to the specified target device, and the task execution information can be used to pass the platform Remotely control the target device to run the target task.
[0061] It is understandable that the target task can be saved to the target device through network transmission, so that the target device can run the target task; the target device can be multiple devices, for example, the target task needs to run on multiple different devices. At this time, there are multiple target devices.
[0062] The task monitoring module 4 is used to obtain the operation information of each stage in the running process of the target task for the visualization module 6 to display.
[0063] Specifically, since the target task is composed of multiple atomic tasks, when the target task is running on the target device, different atomic tasks run as subtasks on different nodes of the target device. The subtasks can include one atomic task or multiple atoms. The number of atomic tasks included in tasks and subtasks is less than that of the target task. In order to realize the whole process of monitoring the target task and provide users with a clearer task progress, the task monitoring module 4 obtains the operation information of each stage in the target task operation process, namely Obtain the running information of each subtask of the target task during the running process. The running information can include the node where the subtask is located, how many nodes there are, which subtask is currently executed, the running progress of the running subtask, etc. All target tasks Run information on the target device to realize the user's full monitoring of the target task.
[0064] The task control module 5 is used to control the target subtasks of the target task by using the task control information input by the user.
[0065] Specifically, after being able to obtain all the running information of the target task on the target device, the user can exercise control during the running process according to the running status of the target task, for example, add an interrupt to a subtask and modify the data of unexecuted subtasks. , Set the execution delay time and so on.
[0066] Of course, you can also use the task control information to re-set target tasks that are not running.
[0067] The visualization module 6 is used to display various information to the user.
[0068] Specifically, the visualization module 6 corresponds to the display, and various information such as the task scheduling process, task allocation process, task running information and so on can be displayed to the user by the visualization module 6 through the display to provide guarantee for the visualization operation.
[0069] It can be seen that the embodiment of the present invention realizes the visual arrangement of the atomic task as a unit, and improves the generation efficiency of the target task. After the target task is assigned to the target device, the operation of the target task in each stage of the target device can be performed. Surveillance realizes the comprehensive monitoring of the target task during the operation period, and obtains all the operation information of the target task during the entire operation period, so that the user can input task control information through the platform according to the operation information, and use the task control information to control the running target task To control the target sub-tasks, realize the full-process monitoring of the target task, improve the management efficiency of the task, cover a wider range of application scenarios, build a flexible task scheduling engine, establish a complex operation and maintenance scene scheduling mechanism, and standardize the operation The orderly combination of atomic maintenance and operation affairs is driven by the automation engine to realize the standardization and centralized management of operation and maintenance operations to meet the needs of different operation and maintenance scenarios.
[0070] The embodiment of the present invention discloses a specific distributed intelligent operation and maintenance platform. Compared with the previous embodiment, this embodiment further explains and optimizes the technical solution. See figure 2 Shown, specific:
[0071] In the embodiment of the present invention, the distributed intelligent operation and maintenance platform, see Figure 3 to Figure 6 As shown, it can also include:
[0072] Configuration management module, used to store equipment resources and personnel information.
[0073] Further, the configuration management module mainly performs resource object management, resource model management, resource association management, resource index parameter management, etc., so that operation and maintenance personnel can add, modify, and delete resource categories, index parameters, and association relationships, etc. maintenance work.
[0074] It can realize management according to the operation and maintenance objects involved in the business system, including:
[0075] Resource category management, including: host, database, middleware, business system, network security equipment, etc.; determine resource attributes according to the type of resource object;
[0076] Resource attribute management, including: information system type, information system name, resource type, resource name, and importance level of the resource;
[0077] Resource management relationship management, including: maintenance and management of operating dependencies between resources;
[0078] The function rule class can also be realized: according to user authority allocation resource configuration, system administrators and administrators have the authority to add resources individually, batch import, export, modify and delete resources, and ordinary users only have the functions of resource pool creation and resource authority assignment.
[0079] In the embodiment of the present invention, the distributed intelligent operation and maintenance platform, see Figure 7 As shown, it can also include:
[0080] Intelligent operation and maintenance module for automatic fault handling.
[0081] Further, the intelligent operation and maintenance module can process scene-based operation and maintenance tasks based on actual operation and maintenance scenarios, including: self-healing of faults, for example, after triggering a disk space full alarm, automatic file cleaning is realized, and the process service is suspended. After the startup, the process automatically restarts and performs a health check. After the server uptime is abnormal, the configuration check is initialized. After the performance data exceeds the limit, the Top process is checked and processed. When the compliance check is performed, it is automatically corrected to realize the root cause Diagnosis can comprehensively analyze the configuration data/performance data/log data of IT components, combine the architecture topology and fault analysis model, and output fault analysis reports through the expert knowledge base.
[0082] In the embodiment of the present invention, the distributed intelligent operation and maintenance platform may also include:
[0083] The monitoring management module is used to monitor the status of software and hardware in real time and generate alarm information.
[0084] Specifically, the monitoring management module implements real-time monitoring of software and hardware for operation and maintenance objects, including real-time monitoring of alarm events and performance data of operation and maintenance objects.
[0085] Specifically, the function realization includes:
[0086] A user on a server trusts a user on the client and allows this user to log in without entering a password.
[0087] No agent is required, and the existing network environment is not affected, and various indicators and parameter values of monitoring operation and maintenance objects are implemented.
[0088] Out-of-band management, using API means to obtain data from server equipment such as log acquisition and configuration information collection, so that the administrator can find abnormal information in advance.
[0089] In the embodiment of the present invention, the distributed intelligent operation and maintenance platform, see Figure 8 to Figure 14 , Can also include:
[0090] The job management module is used for job scheduling management of operation and maintenance jobs.
[0091] Specifically, the job management module is mainly used to perform operation and maintenance job scheduling management tasks including: adding operation and maintenance tools, starting and stopping, uploading, downloading, historical version rollback, and parameter configuration.
[0092] The job management module can adopt a distributed timing task framework, which can organize and arrange tasks flexibly. In addition, it can well control task control and coordination. The distributed architecture achieves high availability, and the job engine is uninterrupted scheduling and running Efficient.
[0093] Among them, the specific functions are as follows:
[0094] Timing task: execute timing tasks based on the cron expression of the mature timing task job framework.
[0095] The job registration center, including the global job registration control center, is used to register, control and coordinate the execution of distributed jobs.
[0096] Flexible expansion and contraction: the running job server crashes, or n new job servers are added, the job framework will be re-sharded before the next job execution, without affecting the current job execution.
[0097] Supports multiple job execution modes: supports OneOff, Perpetual and SequencePerpetual three job modes.
[0098] Failover: The crash of the running job server will not cause re-sharding, it will only be sharded when the next job is started; enabling the failover function can monitor the idleness of other job servers during the execution of this job, and grab unfinished orphans Sharding item execution.
[0099] Run-time status collection: monitor the running status of the job, count the number of successful and failed data processed in the last period of time, and record the start time, end time and next run time of the job.
[0100] Re-triggered by missed jobs: automatically record missed jobs, and automatically trigger after the last job is completed.
[0101] Multithreading to process data quickly: Use multithreading to process the captured data to improve throughput.
[0102] Idempotence: Repetitive job task items are judged, and the running job task items are not executed repeatedly. Since turning on idempotence requires monitoring the running status of the job, it has a great impact on the performance of the job that runs repeatedly in an instant.
[0103] Fault-tolerant processing: If the job server fails to communicate with the Zookeeper server, the job will be stopped immediately to prevent the job registry from assigning invalid shards to other job servers while the current job server is still executing tasks, resulting in repeated execution.
[0104] Spring support: support spring container, custom namespace, support placeholder.
[0105] Operation and maintenance platform: Provides an operation and maintenance interface that can manage operations and registration centers.
[0106] The job management module can include a tool management unit, which is used to provide a unified visual script management and customization platform. The operation and maintenance personnel customize the daily change tools according to the standardized requirements to form a unified change entry. The tool management unit provides various operation and maintenance tools. Add, delete, modify and check.
[0107] The job management module mainly completes the configuration of atomic tasks, creates independent operation and maintenance tools for all operation and maintenance objects entered in the configuration management module, and provides basic functions of operation and maintenance tools for the flexible operation of the operation platform.
[0108] The front-end page provides a new atomic task window, by locking the resource name of the operation and maintenance object, matching the function of the resource name, entering the authorized user and password given by the resource object, completing the resource object IP address and communication port, and targeting the special tasks set in the early stage A series of functions can be completed through the configuration of parameters to complete atomic tasks. In addition, the functions of editing and deleting atomic tasks are provided.
[0109] The job management module may also include a job report unit for displaying the results of all jobs created by users on a unified page.
[0110] Specifically, through the establishment of a flexible reporting platform, a unified data output window is provided for operators, and a convenient and fast data query channel is provided at the same time.
[0111] Comprehensive comparison, analysis and evaluation of the operation and maintenance objects through the data generated by the operation and maintenance operations in horizontal, vertical, normal, abnormal and other ways, combined with the graphical display of visual data, a comprehensive summary and analysis of the operation status and performance of the operation and maintenance objects Guiding treatment suggestions.
[0112] The report function is open to all users. Ordinary users can only add, view, modify, and delete jobs created by themselves. Other ordinary users do not have delete permissions. Administrators can add, view, modify, and delete reports added by all users in their group. System management can add, view, modify, and delete reports added by all users in all groups and departments.
[0113] Among them, see Figure 15 As shown, the distributed intelligent operation and maintenance platform can include:
[0114] Application server: Provides functions such as project operating environment, platform web access, and function scheduling.
[0115] Business database server: user management, operating platform, timing tasks and other data access and scheduling.
[0116] Configuration management server: CMDB, monitoring module, self-healing management rules, script tools, persistent storage of uploaded file data.
[0117] Report server: Provides functions such as report running environment, report web template design, and persistent storage of report data.
[0118] Deployment server: Provide Linux bare metal deployment services, including the initial installation of physical servers, management and distribution of system standard images, etc.
[0119] API interface server: Provides API project operating environment and interface access functions.
[0120] Correspondingly, the embodiment of the present invention also discloses a distributed intelligent operation and maintenance method, see Figure 16 As shown, the method includes:
[0121] S1: Use the task scheduling information input by the user and multiple atomic tasks to generate the target task;
[0122] S2: Use the task assignment information input by the user to assign the target task to the target device;
[0123] S3: Use the task execution information input by the user to make the target device run the target task;
[0124] S4: Obtain the operation information of each stage in the operation of the target task for display by the visualization module;
[0125] S5: Use the task control information input by the user to control the target subtasks of the target task;
[0126] S6: Display various information to users.
[0127] Among them, after performing any of the steps S1 to S5, S6 can be performed to display information to the user. Therefore, the execution sequence of S6 is not limited here.
[0128] Wherein, the foregoing process of obtaining operation information of each stage in the operation of the target task includes obtaining operation information of each subtask of the target task during the operation.
[0129] Further, the distributed intelligent operation and maintenance method can also store the resource information of equipment and applications, as well as information such as personnel authority allocation; automatic fault processing; real-time monitoring of software and hardware status, generating alarm information; operation scheduling management of operation and maintenance operations .
[0130] In addition, the embodiment of the present invention also discloses a distributed intelligent operation and maintenance device, including:
[0131] Memory, used to store computer programs;
[0132] The processor is used to execute a computer program to implement the aforementioned distributed intelligent operation and maintenance method.
[0133] For the specific content of the distributed intelligent operation and maintenance method, reference may be made to the content disclosed in the foregoing embodiments, and details are not described herein again.
[0134] In addition, the embodiment of the present invention also discloses a computer-readable storage medium, and a computer program is stored on the computer-readable storage medium. When the computer program is executed by a processor, the steps of the aforementioned distributed intelligent operation and maintenance method are realized.
[0135] For the specific content of the distributed intelligent operation and maintenance method, reference may be made to the content disclosed in the foregoing embodiment, and details are not described herein again.
[0136] Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities Or there is any such actual relationship or sequence between operations. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or device that includes a series of elements includes not only those elements, but also includes Other elements of, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or equipment including the element.
[0137] Professionals can further realize that the units and algorithm steps of the examples described in the embodiments disclosed in this article can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the possibilities of hardware and software. Interchangeability. In the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the present invention.
[0138] The foregoing describes in detail a distributed intelligent operation and maintenance platform, method, device, and computer-readable storage medium provided by the present invention. Specific examples are used in this article to illustrate the principles and implementation of the present invention. The above implementation The description of the examples is only used to help understand the method and the core idea of the present invention; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation and the scope of application. In summary As mentioned, the content of this specification should not be construed as limiting the present invention.