Fault processing method and device, fault processing equipment and storage medium
1. A method of fault handling, the method comprising:
responding to a trigger operation for carrying out fault processing on a target cluster, and acquiring a target fault troubleshooting path corresponding to the target cluster; the target troubleshooting path comprises N troubleshooting levels, wherein N is an integer larger than 1;
performing troubleshooting on the target cluster based on N troubleshooting levels in the target troubleshooting path to obtain a troubleshooting result of each troubleshooting level corresponding to the target cluster;
determining M abnormal troubleshooting levels according to a troubleshooting result of each troubleshooting level of the target cluster, wherein M is not more than N and is a positive integer;
and determining a strategy corresponding to each abnormal troubleshooting level in the M abnormal troubleshooting levels, and executing fault processing based on the strategy corresponding to each abnormal troubleshooting level.
2. The method of claim 1, wherein one troubleshooting level corresponds to one priority;
the troubleshooting the target cluster based on the N troubleshooting levels in the target troubleshooting path includes:
acquiring the priority of each troubleshooting level in the N troubleshooting levels;
and sequentially carrying out troubleshooting on the target cluster according to the sequence of the priorities from high to low on the basis of the priority of each troubleshooting level in the N troubleshooting levels.
3. The method of claim 1, wherein the performing troubleshooting on the target cluster based on N troubleshooting levels in the target troubleshooting path to obtain a troubleshooting result of each troubleshooting level corresponding to the target cluster comprises:
acquiring log information corresponding to a target troubleshooting level; wherein the target troubleshooting level is any one of the N troubleshooting levels;
and carrying out fault troubleshooting on the log information corresponding to the target fault troubleshooting level to obtain a target fault troubleshooting result of the target fault troubleshooting level.
4. The method of claim 3, wherein after obtaining the troubleshooting results for each troubleshooting level corresponding to the target cluster, the method further comprises:
generating a fault notification list based on the fault troubleshooting result of each fault troubleshooting level corresponding to the target cluster;
the fault notification list comprises a fault troubleshooting level identification field item and a fault troubleshooting result field item; the identification of the target troubleshooting level is stored at any position of the troubleshooting level identification field item, and the target troubleshooting result is stored at a position corresponding to any position in the troubleshooting result field item.
5. The method of claim 1, wherein after determining M abnormal troubleshooting levels from the troubleshooting results for each troubleshooting level of the target cluster, the method further comprises:
and acquiring log information of the M abnormal troubleshooting levels, and displaying the log information corresponding to the M abnormal troubleshooting levels through a log information display interface.
6. The method of claim 1, wherein said determining a policy corresponding to each of the M exception troubleshooting levels comprises:
acquiring fault scene information of the target cluster;
and determining a strategy corresponding to each abnormal troubleshooting level according to the fault scene information of the target cluster.
7. The method according to any one of claims 1-6, wherein the determining a policy corresponding to each exception troubleshooting level of the M exception troubleshooting levels and performing fault handling based on the policy corresponding to each exception troubleshooting level comprises:
when key fault troubleshooting levels exist in the M abnormal fault troubleshooting levels, acquiring a reference cluster corresponding to the target cluster;
and sending the service data in the target cluster to the reference cluster so that the reference cluster executes the service corresponding to the target cluster.
8. A fault handling apparatus, characterized in that the apparatus comprises:
the device comprises an acquisition unit, a fault detection unit and a fault detection unit, wherein the acquisition unit is used for responding to the triggering operation of fault processing on a target cluster and acquiring a target fault troubleshooting path corresponding to the target cluster; the target troubleshooting path comprises N troubleshooting levels, wherein N is an integer larger than 1;
the troubleshooting unit is used for troubleshooting the target cluster based on N troubleshooting levels in the target troubleshooting path to obtain a troubleshooting result of each troubleshooting level corresponding to the target cluster;
the determining unit is used for determining M abnormal troubleshooting levels according to the troubleshooting result of each troubleshooting level of the target cluster, wherein M is not more than N and is a positive integer;
and the fault processing unit is used for determining a strategy corresponding to each abnormal troubleshooting level in the M abnormal troubleshooting levels and executing fault processing based on the strategy corresponding to each abnormal troubleshooting level.
9. A fault handling device comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, wherein the memory is configured to store a computer program, the computer program comprising a program, the processor being configured to invoke the program to perform the fault handling method according to any one of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the fault handling method according to any one of claims 1-7.
Background
With the popularity of containers and container organization technology, more and more enterprises manage their IT systems through (kubernets, K8S) management platforms. The K8S management platform is mainly used for managing each cluster in the IT system. During the process of executing business by the IT system, each cluster in the IT system may fail, and the existing solution generally operates and maintains the cluster in the IT system in the K8S management platform in a manual manner. However, with the development of scientific technology, due to the factors such as the huge number of containers in the cluster managed by the K8S management platform, or the limited operation and maintenance capability of the operation and maintenance personnel, the difficulty and efficiency of manual operation and maintenance may be large, and it is impossible to quickly and accurately perform fault processing on the fault cluster. Therefore, how to quickly and accurately perform fault handling on a fault cluster is an important research topic.
Disclosure of Invention
The embodiment of the application provides a fault processing method and device, fault processing equipment and a storage medium, wherein the cluster is subjected to troubleshooting analysis through a troubleshooting path, and fault processing can be performed more accurately.
In a first aspect, an embodiment of the present application provides a fault handling method, where the fault handling method includes:
responding to a trigger operation for carrying out fault processing on a target cluster, and acquiring a target fault troubleshooting path corresponding to the target cluster; the target troubleshooting path comprises N troubleshooting levels, wherein N is an integer larger than 1;
carrying out troubleshooting on the target cluster based on N troubleshooting levels in the target troubleshooting path to obtain a troubleshooting result of each troubleshooting level corresponding to the target cluster;
determining M abnormal troubleshooting levels according to a troubleshooting result of each troubleshooting level of the target cluster, wherein M is not more than N and is a positive integer;
and determining a strategy corresponding to each abnormal troubleshooting level in the M abnormal troubleshooting levels, and executing fault processing based on the strategy corresponding to each abnormal troubleshooting level.
In a second aspect, an embodiment of the present application provides a fault handling apparatus, including:
the acquisition unit is used for responding to the trigger operation of fault processing on the target cluster and acquiring a target fault troubleshooting path corresponding to the target cluster; the target troubleshooting path comprises N troubleshooting levels, wherein N is an integer larger than 1;
the troubleshooting unit is used for troubleshooting the target cluster based on the N troubleshooting levels in the target troubleshooting path to obtain a troubleshooting result of each troubleshooting level corresponding to the target cluster;
the determining unit is used for determining M abnormal troubleshooting levels according to a troubleshooting result of each troubleshooting level of the target cluster, wherein M is not more than N and is a positive integer;
the fault processing unit is used for determining a strategy corresponding to each abnormal troubleshooting level in the M abnormal troubleshooting levels and executing fault processing based on the strategy corresponding to each abnormal troubleshooting level.
In a third aspect, the present application provides a fault handling device, where the fault handling device includes a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, where the memory is used to store a computer program, and the computer program includes a program, and the processor is configured to call the program to execute the fault handling method according to the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the fault handling method of the first aspect.
In the embodiment of the application, the fault processing device may respond to a trigger operation for performing fault processing on a target cluster, and acquire a target troubleshooting path corresponding to the target cluster, where the target troubleshooting path includes N troubleshooting levels; n is an integer greater than 1. And carrying out fault troubleshooting on the target cluster based on the N fault troubleshooting levels in the target fault troubleshooting path to obtain a fault troubleshooting result of each fault troubleshooting level corresponding to the target cluster. Determining M abnormal troubleshooting levels according to a troubleshooting result of each troubleshooting level of the target cluster, wherein M is not more than N and is a positive integer; and determining a strategy corresponding to each abnormal troubleshooting level in the M abnormal troubleshooting levels, and executing fault processing based on the strategy corresponding to each abnormal troubleshooting level. The fault processing equipment in the embodiment of the application can directly perform fault troubleshooting on the target cluster according to the target fault troubleshooting path without manual participation, so that the manpower resource can be effectively saved, and the fault processing efficiency is improved; in addition, the method is not limited by the operation and maintenance capability of the operation and maintenance user, and can be used for quickly and accurately positioning the fault. In addition, since the strategy is preset for the abnormal troubleshooting level, the fault processing can be directly executed based on the strategy corresponding to the abnormal troubleshooting level. The fault processing process is not limited by the operation and maintenance capability of the operation and maintenance user, fault processing can be executed quickly and accurately, and the efficiency and accuracy of fault processing are further improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic architecture diagram of a K8S management platform according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a fault handling method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a blockchain according to an embodiment of the present invention;
fig. 4 is a schematic flowchart of another fault handling method provided in the embodiment of the present application;
fig. 5 is a schematic structural diagram of a fault handling apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a fault handling device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
With the popularity of containers and container arrangement technologies, more and more enterprises manage the IT systems of the enterprises through the K8S management platform, and the rapid development of the enterprises is assisted. Through development in a period of time, the number of clusters managed by the K8S management platform is increased, and the number of managed containers is increased sharply, which brings complexity to traditional operation and maintenance work, thereby affecting the stability of the IT system. Therefore, a high-performance K8S management platform is needed to manage these large number of containers and clusters to reduce operational risk. For this purpose, K8S management platforms such as kubernets-dashboards, Kuboard, kubes phere, Rancher and Lens, etc. are introduced in succession in the industry. However, the existing K8S management platforms only cover the basic management functions of K8S, such as: and (5) building a cluster, and managing namespace, pod and the like. In order to ensure the stability of the IT system, the K8S management platform needs to have fault diagnosis and fault processing functions in addition to basic management functions.
Based on this, the embodiment of the application provides a fault processing method, a fault processing device and a storage medium. In the fault processing method, a fault processing device responds to a trigger operation for carrying out fault processing on a target cluster, and obtains a target fault troubleshooting path corresponding to the target cluster, wherein the target fault troubleshooting path comprises N fault troubleshooting levels; n is an integer greater than 1. And carrying out fault troubleshooting on the target cluster based on the N fault troubleshooting levels in the target fault troubleshooting path to obtain a fault troubleshooting result of each fault troubleshooting level corresponding to the target cluster. Determining M abnormal troubleshooting levels according to a troubleshooting result of each troubleshooting level of the target cluster, wherein M is not more than N and is a positive integer; and determining a strategy corresponding to each abnormal troubleshooting level in the M abnormal troubleshooting levels, and executing fault processing based on the strategy corresponding to each abnormal troubleshooting level. Manpower resources can be effectively saved, and the fault processing efficiency is improved; in addition, the fault processing method is not limited by the operation and maintenance capability of the operation and maintenance user, and can quickly and accurately process the fault of the fault cluster.
In one embodiment, the fault handling method may be used for fault handling of a faulty cluster. The fault handling method may be applied to the K8S management platform shown in fig. 1, and as shown in fig. 1, the K8S management platform may at least include: a fault handling device 11 and a cluster 12. The failure processing device 11 may be any device having a data processing capability. The fault processing device 11 may be a server as shown in fig. 1, where the server may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a Content Delivery Network (CDN), a middleware service, a domain name service, a security service, a big data and artificial intelligence platform, and the like. The fault handling device 11 may also be a terminal device, which may include but is not limited to: smart phones, tablets, laptops, wearable devices, desktop computers, and the like. Wherein a cluster 12 may refer to a collection of servers (or container servers). The cluster 12 may include one or more servers, for example, the cluster 12 may be a collection of servers 12a, 12b, and 12c as shown in FIG. 1.
Based on the above description, the fault handling method of the embodiment of the present application is set forth in detail below. Referring to fig. 2, fig. 2 illustrates a fault handling method. As shown in fig. 2, the method includes S201-S204:
s201: responding to a trigger operation for carrying out fault processing on a target cluster, and acquiring a target fault troubleshooting path corresponding to the target cluster, wherein the target fault troubleshooting path comprises N fault troubleshooting levels; n is an integer greater than 1.
The target cluster may be any cluster that establishes a communication connection with the failure processing device, and the target cluster may include one or more servers.
In one embodiment, when a server in a target cluster fails, the failure processing device may determine that a trigger operation for performing failure processing on the target cluster is detected when an alarm message from the target cluster is detected.
When the server in the target cluster fails, the target cluster performs alarm reminding. When the target cluster performs alarm reminding, an alarm message is sent to a responsible person (i.e. a person who performs maintenance on the target cluster), and then the fault handling device may determine to detect the trigger operation of performing fault handling on the target cluster when determining that the target cluster sends the alarm message to the responsible person.
One cluster may correspond to one troubleshooting path, and the troubleshooting path may include a plurality of troubleshooting levels. For example, a troubleshooting path may include 7 troubleshooting levels, where the 7 troubleshooting levels are: the health degree of a core component (MASTER), the load condition of the core component, the overall load condition of a machine NODE (NODE), the load condition of each machine NODE, the number of DNS errors of each machine NODE, the number of INGRESS errors of each machine NODE, and the label of each machine NODE. Note that, the machine node according to the embodiment of the present application may be a server.
The fault processing device may obtain a target troubleshooting path corresponding to the target cluster. In an embodiment, the troubleshooting paths of the clusters may be the same, and then the failure processing device may directly use the one troubleshooting path as the target troubleshooting path of the target cluster. In another embodiment, the troubleshooting paths for the various clusters are not identical. For example, one troubleshooting path may correspond to a plurality of clusters, and one troubleshooting path may correspond to one cluster, for example. The fault processing device may determine a target troubleshooting path corresponding to the target cluster according to the target cluster. The target troubleshooting path may include N troubleshooting levels, where N is an integer greater than 1.
S202: and carrying out fault troubleshooting on the target cluster based on the N fault troubleshooting levels in the target fault troubleshooting path to obtain a fault troubleshooting result of each fault troubleshooting level corresponding to the target cluster.
In an embodiment, one troubleshooting level may correspond to one priority, and the failure processing device may perform failure troubleshooting on the target cluster in order from high to low in priority based on the priority of each troubleshooting level in the N troubleshooting levels.
For example, still taking the example of S201, the above 7 troubleshooting levels are, in order from high priority to low priority: the health degree of the core component, the load condition of the core component, the overall load condition of the machine nodes, the load condition of each machine node, the number of DNS errors of each machine node, the number of INGRESS errors of each machine node, and the label of each machine node. The fault handling device will then perform the following steps in sequence: and carrying out troubleshooting on the health degree of the core component to obtain a troubleshooting result corresponding to the health degree of the core component. And carrying out troubleshooting on the load condition of the core component to obtain a troubleshooting result corresponding to the load condition of the core component. And carrying out troubleshooting on the whole load condition of the machine node to obtain a troubleshooting result corresponding to the whole load condition of the machine node. And carrying out troubleshooting on the load condition of each machine node to obtain a troubleshooting result corresponding to the load condition of each machine node. And performing troubleshooting on the DNS error times of each machine node to obtain a troubleshooting result corresponding to the DNS error times of each machine node. And troubleshooting the number of INGRESS errors of each machine node to obtain a troubleshooting result corresponding to the number of the INGRESS errors of each machine node. And carrying out troubleshooting on the label of each machine node to obtain a troubleshooting result corresponding to the label of each machine node.
In an embodiment, the failure processing device may perform failure troubleshooting on log information corresponding to a failure troubleshooting level to obtain a failure troubleshooting result corresponding to the failure troubleshooting level. For any fault troubleshooting level (namely, a target fault troubleshooting level) in the N fault troubleshooting levels, the fault processing device may obtain log information corresponding to the target fault troubleshooting level, perform fault troubleshooting on the log information corresponding to the target fault troubleshooting level, and obtain a target fault troubleshooting result of the target fault troubleshooting level. Optionally, the embodiment of the application may further combine machine learning in an artificial intelligence technology to implement intelligent troubleshooting on a target troubleshooting level. Specifically, the fault handling device may construct N fault troubleshooting models through machine learning, and when the fault handling device needs to perform fault troubleshooting on a target fault troubleshooting level, may obtain log information corresponding to the target fault troubleshooting level, and call the fault troubleshooting model corresponding to the target fault troubleshooting level to analyze the log information, so as to obtain a target fault troubleshooting result of the target fault troubleshooting level. And determining a target troubleshooting result corresponding to the target troubleshooting level through the target troubleshooting model, so that the error rate of troubleshooting can be reduced.
Wherein one troubleshooting level may correspond to multiple candidate troubleshooting categories. The troubleshooting device may determine a target troubleshooting category, that is, a troubleshooting result, from the plurality of candidate troubleshooting categories of the troubleshooting hierarchy through troubleshooting. Wherein, the number of the plurality of candidate troubleshooting categories may include two or more.
The number of candidate troubleshooting categories may be directly two. For example, the health degree of the core component may correspond to two candidate troubleshooting categories, namely "normal" and "abnormal", and the failure processing device may determine a target troubleshooting category corresponding to the troubleshooting hierarchy, namely "health degree of the core component", from the two candidate troubleshooting categories, namely "normal" and "abnormal".
For another example, the DNS error count of each machine node may correspond to two candidate troubleshooting categories, i.e., normal and abnormal, and the failure processing device may determine a target troubleshooting category corresponding to the troubleshooting level, i.e., the DNS error count of each machine node, from the two candidate troubleshooting categories, i.e., normal and abnormal. The target troubleshooting category of the troubleshooting hierarchy, which is "DNS error count per machine node", may be related to the number of machine nodes with abnormal DNS error counts. For example, when 2 servers (which may also be referred to as machine nodes) are included in the target cluster, if the number of servers with abnormal DNS error counts is greater than or equal to 1, the target troubleshooting type of the fault hierarchy, which is "DNS error count per machine node", is "abnormal". The server with the abnormal DNS error count may refer to that the DNS error count of the server is greater than a certain threshold.
The number of the plurality of candidate troubleshooting categories may also be more than two. For example, the target troubleshooting category of the "DNS error count per machine node" troubleshooting hierarchy may be related to the number of machine nodes for which the DNS error count is abnormal. When 2 servers are included in the target cluster, the candidate troubleshooting categories for one troubleshooting level, "number of DNS errors per machine node" may include 0, 1, and 2. If the number of servers with abnormal DNS error counts is 0, the target troubleshooting type of the failure hierarchy, which is "DNS error count per machine node", is "0". If the number of servers with abnormal DNS error counts is 1, the target troubleshooting type of the failure hierarchy, which is "DNS error count per machine node", is "1". If the number of servers with abnormal DNS error counts is 2, the target troubleshooting type of the failure hierarchy, which is "DNS error count per machine node", is "2". As another example, the number of DNS errors per machine node "the target troubleshooting category of the troubleshooting hierarchy may be related to the number of DNS errors per server. When 2 servers are included in the target cluster, the number of DNS errors of each machine node may be "less than or equal to 2 times" or "greater than 2 times", respectively, for server 1 and server 2. Then the candidate troubleshooting categories for the troubleshooting hierarchy "number of DNS errors per machine node" may include: "the number of DNS errors of the server 1 is less than or equal to 2 times, the number of DNS errors of the server 2 is less than or equal to 2 times", "the number of DNS errors of the server 1 is less than or equal to 2 times, the number of DNS errors of the server 2 is greater than 2 times", "the number of DNS errors of the server 1 is greater than 2 times, the number of DNS errors of the server 2 is less than or equal to 2 times", and "the number of DNS errors of the server 1 is greater than 2 times, the number of DNS errors of the server 2 is greater than 2 times". And so on.
In an embodiment, when the troubleshooting results corresponding to the N troubleshooting levels are obtained by the failure processing device, a failure notification list may be generated based on the troubleshooting results of the N troubleshooting levels corresponding to the target cluster, so that a notification message is generated based on the failure notification list to remind a responsible person. In particular, the troubleshooting-level identification field entry and the troubleshooting-result field entry may be included in the troubleshooting-notification list. The troubleshooting level identification field entry is used to store an identification of the troubleshooting level, which may be denoted as a "troubleshooting level name," as shown in the first row and the first column in table 1; the troubleshooting result field entry is used for storing a troubleshooting result corresponding to the troubleshooting level, and may be represented as a "troubleshooting result", as shown in the first row and the second column in table 1.
The identification of the target troubleshooting level is stored at any position of the troubleshooting level identification field item, and the target troubleshooting result is stored at a position corresponding to any position in the troubleshooting result field item. As shown in table 1, the name "health degree of core component" of the troubleshooting level "health degree of core component" is stored in the first column of the second row, and the name "health degree of core component" of the troubleshooting level "is stored in the second column of the second row, and the troubleshooting result" normal "of the troubleshooting level is stored in the second column of the second row.
TABLE 1
In one embodiment, for data security, when the failure processing device generates the failure notification list, the failure notification list corresponding to the target cluster may also be written into a Block Chain (Block Chain) for the convenience of viewing by a user.
The block chain is a chain data structure formed by combining data blocks in a sequential connection mode according to a time sequence, and a distributed account book which ensures data to be not falsified and forged in a cryptographic mode is provided. Multiple independent distributed nodes maintain the same record. The blockchain technology realizes decentralization and becomes a foundation for credible digital asset storage, transfer and transaction.
Taking the block chain structure diagram shown in fig. 3 as an example, when new data needs to be written into the block chain, the data is collected into a block (block) and added to the end of the existing block chain, and the newly added block of each node is ensured to be identical through a consensus algorithm. A plurality of printing resource information are recorded in each block, and the printing resource information also comprises a hash (hash) value of the previous block, and all blocks store the hash value of the previous block in the way, and are connected in sequence to form a block chain. The hash value of the previous block is stored in the block head of the next block in the block chain, and when the fault notification list in the current block changes, the hash value of the current block also changes, so that the fault notification list uploaded to the block chain network is difficult to tamper, and the reliability of data is improved.
S203: and determining M abnormal troubleshooting levels according to the troubleshooting result of each troubleshooting level of the target cluster, wherein M is not more than N and is a positive integer.
Specifically, when the target troubleshooting category of the troubleshooting level is the preset troubleshooting category, the failure processing device may determine that the troubleshooting level is the abnormal troubleshooting level. For example, the preset troubleshooting category of the troubleshooting level, which is the health degree of the core component, is "abnormal", and when the target troubleshooting category corresponding to the troubleshooting level, which is the health degree of the core component, is "abnormal", the failure processing device may determine that the troubleshooting level, which is the "health degree of the core component", is the abnormal troubleshooting level.
The fault processing equipment can sequentially determine M abnormal fault troubleshooting levels according to the fault troubleshooting result of each fault troubleshooting level of the target cluster, wherein M is not more than N, and M is a positive integer.
Further, in order to find a root cause of the abnormal troubleshooting level, the failure processing device may obtain log information corresponding to the abnormal troubleshooting level. Optionally, the fault handling device may display log information corresponding to the abnormal troubleshooting level on a log information display interface, so that a person in charge performs analysis by combining experience. Optionally, the fault processing apparatus may also send the log information to a log analysis model, so as to further analyze the fault cause of the abnormal troubleshooting layer.
For example, if the troubleshooting level "health degree of the core component" is an abnormal troubleshooting level, the fault handling device may obtain log information corresponding to the health degree of the core component, and the log information may include one or more of the following: ERROR log, system log, K8S EVENT information, POD monitoring information, host monitoring index information, version alteration condition, etc.
S204: and determining a strategy corresponding to each abnormal troubleshooting level in the M abnormal troubleshooting levels, and executing fault processing based on the strategy corresponding to each abnormal troubleshooting level.
The fault handling equipment is pre-stored with strategies corresponding to the abnormal fault troubleshooting levels, and can determine the strategy corresponding to each abnormal fault troubleshooting level in the M abnormal fault troubleshooting levels and execute fault handling based on the strategy corresponding to each abnormal fault troubleshooting level.
In this embodiment of the present application, the failure processing device may respond to a trigger operation for performing failure processing on a target cluster, and obtain a target troubleshooting path corresponding to the target cluster, where the target troubleshooting path includes N troubleshooting levels. And carrying out fault troubleshooting on the target cluster based on the N fault troubleshooting levels in the target fault troubleshooting path to obtain a fault troubleshooting result of each fault troubleshooting level corresponding to the target cluster. Determining M abnormal troubleshooting levels according to a troubleshooting result of each troubleshooting level of the target cluster; and determining a strategy corresponding to each abnormal troubleshooting level in the M abnormal troubleshooting levels, and executing fault processing based on the strategy corresponding to each abnormal troubleshooting level. The fault processing equipment in the embodiment of the application can directly perform fault troubleshooting on the target cluster according to the target fault troubleshooting path without manual participation, so that the manpower resource can be effectively saved, and the fault processing efficiency is improved; in addition, the method is not limited by the operation and maintenance capability of the operation and maintenance user, and can be used for quickly and accurately positioning the fault. In addition, since the strategy is preset for the abnormal troubleshooting level, the fault processing can be directly executed based on the strategy corresponding to the abnormal troubleshooting level. The fault processing process is not limited by the operation and maintenance capability of the operation and maintenance user, fault processing can be executed quickly and accurately, and the efficiency and accuracy of fault processing are further improved.
As can be seen from the above description of the embodiment of the method shown in fig. 2, the fault handling method shown in fig. 2 may perform fault handling according to the policy corresponding to the abnormal troubleshooting level. Based on this, when the strategy corresponding to the abnormal troubleshooting level needs to be determined, in order to perform fault processing more accurately, the strategy corresponding to the abnormal troubleshooting level can be determined according to the fault scene information of the target cluster, and fault processing is performed based on the strategy. Referring to fig. 4, an embodiment of the present application further provides a fault handling method, where the fault handling method includes S401 to S403:
s401: and acquiring the fault scene information of the target cluster.
The fault scenario information may include alarm information of a service corresponding to the target cluster. For example, traffic in the target cluster occurs 502, and for example, traffic in the target cluster occurs unbnowhost, and so on.
S402: and determining a strategy corresponding to each abnormal troubleshooting level according to the fault scene information of the target cluster.
For one exception troubleshooting level, different fault scenario information may correspond to different strategies. For example, when the number of INGRESS errors of each machine node is an abnormal troubleshooting level, if a service corresponding to the target cluster occurs 502, the failure processing device may determine that a policy corresponding to the abnormal troubleshooting level is: and restarting the ingress on each machine node, monitoring and checking the ingress on each machine node if the target cluster still fails, checking whether the ingress is abnormal, and closing the machine node with the abnormal ingress.
If an unbnowhost occurs in a service corresponding to the target cluster, the fault handling device may determine that a policy corresponding to the abnormal troubleshooting level is: the coredns on each machine node is restarted in turn. And if the target cluster still fails, restarting the kube-proxy on each machine node. If the target cluster still fails, performing monitoring check on the ingress on each machine node, checking whether abnormal ingress exists, and shutting down the machine node with abnormal ingress.
S403: and executing fault processing according to the strategy corresponding to the abnormal fault troubleshooting level.
In some embodiments, it may also be necessary to transfer the service data in the target cluster to the reference cluster corresponding to the target cluster, so that the reference cluster may execute the service successively.
Optionally, when the target cluster fails (that is, a server in the target cluster fails or a machine node in the target cluster fails), the failure processing device may directly obtain a reference cluster corresponding to the target cluster, and send the service data in the target cluster to the reference cluster, so that the reference cluster executes the service corresponding to the target cluster.
Optionally, when a key troubleshooting level exists in M abnormal troubleshooting levels in the target cluster, the fault processing device may obtain a reference cluster corresponding to the target cluster, and send service data in the target cluster to the reference cluster, so that the reference cluster executes a service corresponding to the target cluster. For example, since the troubleshooting level "the health degree of the core component" is a key troubleshooting level of the target cluster, the service impact on the target cluster is large, and when the troubleshooting level "the health degree of the core component" is an abnormal troubleshooting level, the fault processing device needs to send the service data in the target cluster to the reference cluster, so that the reference cluster executes the service corresponding to the target cluster.
Optionally, whether to perform data transfer may also be determined according to the number M of the abnormal troubleshooting levels. When M is greater than the preset value, most of the N troubleshooting levels have an abnormality, and then the failure processing device may send the service data in the target cluster to the reference cluster, so that the reference cluster executes the service corresponding to the target cluster.
In the embodiment of the application, when the strategy corresponding to the abnormal troubleshooting level needs to be determined, the strategy corresponding to the abnormal troubleshooting level can be determined according to the fault scene information of the target cluster, and fault processing is executed based on the strategy. The fault processing equipment considers the fault scene information of the target cluster when executing fault processing, so that the strategy corresponding to the abnormal fault troubleshooting level can be more accurately determined, fault processing is executed based on the strategy of the abnormal fault troubleshooting level, and fault processing can be more accurately executed.
Based on the description of the foregoing fault handling method embodiment, the present application further discloses a fault handling apparatus, which may be a computer program (including program code) running in the foregoing fault handling device. The fault handling means may perform the method shown in figure 2 or figure 4. Referring to fig. 5, the fault handling apparatus may operate as follows:
an obtaining unit 501, configured to obtain a target troubleshooting path corresponding to a target cluster in response to a trigger operation for performing fault processing on the target cluster; the target troubleshooting path comprises N troubleshooting levels, wherein N is an integer larger than 1;
a troubleshooting unit 502, configured to perform troubleshooting on the target cluster based on the N troubleshooting levels in the target troubleshooting path to obtain a troubleshooting result of each troubleshooting level corresponding to the target cluster;
a determining unit 503, configured to determine M abnormal troubleshooting levels according to a troubleshooting result of each troubleshooting level of the target cluster, where M is not greater than N, and M is a positive integer;
the failure processing unit 504 is configured to determine a policy corresponding to each exception troubleshooting level in the M exception troubleshooting levels, and perform failure processing based on the policy corresponding to each exception troubleshooting level.
In some possible embodiments, one troubleshooting level corresponds to one priority;
the troubleshooting unit 502 performs troubleshooting on the target cluster based on N troubleshooting levels in the target troubleshooting path, including:
acquiring the priority of each troubleshooting level in the N troubleshooting levels;
and sequentially carrying out troubleshooting on the target cluster according to the sequence of the priorities from high to low based on the priority of each troubleshooting level in the N troubleshooting levels.
In some possible embodiments, the troubleshooting unit 502 performs troubleshooting on the target cluster based on N troubleshooting levels in the target troubleshooting path to obtain a troubleshooting result of each troubleshooting level corresponding to the target cluster, including:
acquiring log information corresponding to a target troubleshooting level; the target troubleshooting level is any one of the N troubleshooting levels;
and carrying out fault troubleshooting on the log information corresponding to the target fault troubleshooting level to obtain a target fault troubleshooting result of the target fault troubleshooting level.
In some possible embodiments, after the troubleshooting unit 502 obtains the troubleshooting result of each troubleshooting level corresponding to the target cluster, the troubleshooting unit 502 is further configured to:
generating a fault notification list based on the fault troubleshooting result of each fault troubleshooting level corresponding to the target cluster;
the fault notification list comprises a fault troubleshooting level identification field item and a fault troubleshooting result field item; the identification of the target troubleshooting level is stored at any position of the troubleshooting level identification field item, and the target troubleshooting result is stored at a position corresponding to any position in the troubleshooting result field item.
In some possible embodiments, after the determining unit 503 determines M abnormal troubleshooting levels according to the troubleshooting result of each troubleshooting level of the target cluster, the obtaining unit 501 is further configured to obtain log information of the M abnormal troubleshooting levels, and display the log information corresponding to the abnormal troubleshooting level through a log information display interface.
In some possible embodiments, the failure processing unit 504 is configured to determine a policy corresponding to each of the M exception troubleshooting levels, including:
acquiring fault scene information of a target cluster;
and determining a strategy corresponding to each abnormal troubleshooting level according to the fault scene information of the target cluster.
In some possible embodiments, the fault handling unit 504 is configured to determine a policy corresponding to each exception troubleshooting level in the M exception troubleshooting levels, and perform fault handling based on the policy corresponding to each exception troubleshooting level, including:
when key fault troubleshooting levels exist in the M abnormal fault troubleshooting levels, acquiring a reference cluster corresponding to a target cluster;
and sending the service data in the target cluster to the reference cluster so that the reference cluster executes the service corresponding to the target cluster.
It can be understood that each unit of the fault handling apparatus of this embodiment may be specifically implemented according to the method in the foregoing method embodiment fig. 2 or fig. 4, and a specific implementation process thereof may refer to the description related to the method embodiment fig. 2 or fig. 4, which is not described herein again.
According to another embodiment of the present application, the units in the fault handling apparatus shown in fig. 5 may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) may be further split into multiple functionally smaller units to form one or several other units, which may achieve the same operation without affecting the achievement of the technical effect of the embodiment of the present application. The units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present application, the fault handling apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of multiple units.
According to another embodiment of the present application, the Processing element and the memory element may include a Central Processing Unit (CPU), a random access memory medium (RAM), a read only memory medium (ROM), and the like. A general purpose computing device, such as a computer, runs a computer program (including program code) capable of executing the steps involved in the corresponding method as shown in fig. 2 or fig. 4, to construct a fault handling apparatus as shown in fig. 5, and to implement the fault handling method of the embodiments of the present application. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into and executed in the above-described failure processing apparatus via the computer-readable recording medium.
In this embodiment of the application, the fault processing apparatus may respond to a trigger operation for performing fault processing on the target cluster, obtain a target troubleshooting path corresponding to the target cluster, where the target troubleshooting path includes N troubleshooting levels, and perform fault troubleshooting on the target cluster based on the N troubleshooting levels in the target troubleshooting path, to obtain a fault troubleshooting result of each troubleshooting level corresponding to the target cluster. Determining M abnormal troubleshooting levels according to a troubleshooting result of each troubleshooting level of the target cluster; and determining a strategy corresponding to each abnormal troubleshooting level in the M abnormal troubleshooting levels, and executing fault processing based on the strategy corresponding to each abnormal troubleshooting level. The fault processing device in the embodiment of the application can directly perform fault troubleshooting on the target cluster according to the target fault troubleshooting path without manual participation, so that the manpower resource can be effectively saved, and the fault processing efficiency is improved; in addition, the method is not limited by the operation and maintenance capability of the operation and maintenance user, and can be used for quickly and accurately positioning the fault. In addition, since the strategy is preset for the abnormal troubleshooting level, the fault processing can be directly executed based on the strategy corresponding to the abnormal troubleshooting level. The fault processing process is not limited by the operation and maintenance capability of the operation and maintenance user, fault processing can be executed quickly and accurately, and the efficiency and accuracy of fault processing are further improved.
Based on the description of the embodiment of the fault handling method, the embodiment of the application also discloses a fault handling device. Referring to fig. 6, the fault handling apparatus includes at least a processor 601, an input interface 602, an output interface 603, and a computer storage medium 604, which may be connected by a bus or other means.
The computer storage medium 604 is a memory device in the failure processing device for storing programs and data. It is understood that the computer storage medium 604 herein may include both the built-in storage medium of the fault handling device and, of course, the extended storage medium supported by the fault handling device. The computer storage media 604 provides storage space that stores the operating system of the fault handling device. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by processor 601. Note that the computer storage media herein can be high-speed RAM memory; optionally, the system may further include at least one computer storage medium remote from the processor, where the processor may be referred to as a Central Processing Unit (CPU), which is a core of the fault Processing device and a control center, and is adapted to implement one or more instructions, specifically load and execute the one or more instructions, so as to implement the corresponding method flow or function.
In one embodiment, one or more instructions stored in the computer storage medium 604 may be loaded and executed by the processor 601 to implement the steps involved in performing the corresponding method as shown in fig. 2 or fig. 4, and in particular, the one or more instructions in the computer storage medium 604 may be loaded and executed by the processor 601 to implement the steps of:
responding to a trigger operation for carrying out fault processing on a target cluster, and acquiring a target fault troubleshooting path corresponding to the target cluster; the target troubleshooting path comprises N troubleshooting levels; n is an integer greater than 1;
carrying out troubleshooting on the target cluster based on N troubleshooting levels in the target troubleshooting path to obtain a troubleshooting result of each troubleshooting level corresponding to the target cluster;
determining M abnormal troubleshooting levels according to a troubleshooting result of each troubleshooting level of the target cluster, wherein M is not more than N and is a positive integer;
and determining a strategy corresponding to each abnormal troubleshooting level in the M abnormal troubleshooting levels, and executing fault processing based on the strategy corresponding to each abnormal troubleshooting level.
In some possible embodiments, one troubleshooting level corresponds to one priority;
the processor 601 performs troubleshooting on the target cluster based on N troubleshooting levels in the target troubleshooting path, including:
acquiring the priority of each troubleshooting level in the N troubleshooting levels;
and sequentially carrying out troubleshooting on the target cluster according to the sequence of the priorities from high to low based on the priority of each troubleshooting level in the N troubleshooting levels.
In some possible embodiments, the performing, by the processor 601, a troubleshooting on the target cluster based on N troubleshooting levels in the target troubleshooting path to obtain a troubleshooting result of each troubleshooting level corresponding to the target cluster includes:
acquiring log information corresponding to a target troubleshooting level; the target troubleshooting level is any one of the N troubleshooting levels;
and carrying out fault troubleshooting on the log information corresponding to the target fault troubleshooting level to obtain a target fault troubleshooting result of the target fault troubleshooting level.
In some possible embodiments, after the processor 601 obtains the troubleshooting results of each troubleshooting level corresponding to the target cluster, the processor 601 is further configured to:
generating a fault notification list based on the fault troubleshooting result of each fault troubleshooting level corresponding to the target cluster;
the fault notification list comprises a fault troubleshooting level identification field item and a fault troubleshooting result field item; the identification of the target troubleshooting level is stored at any position of the troubleshooting level identification field item, and the target troubleshooting result is stored at a position corresponding to any position in the troubleshooting result field item.
In some feasible embodiments, after the processor 601 determines M abnormal troubleshooting levels according to the troubleshooting result of each troubleshooting level of the target cluster, the processor 601 is further configured to obtain log information of the M abnormal troubleshooting levels, and display the log information corresponding to the abnormal troubleshooting levels through a log information display interface.
In some possible embodiments, the processor 601 is configured to determine a policy corresponding to each of the M exception troubleshooting levels, including:
acquiring fault scene information of a target cluster;
and determining a strategy corresponding to each abnormal troubleshooting level according to the fault scene information of the target cluster.
In some possible embodiments, the processor 601 is configured to determine a policy corresponding to each exception troubleshooting level in the M exception troubleshooting levels, and perform fault processing based on the policy corresponding to each exception troubleshooting level, including:
when key fault troubleshooting levels exist in the M abnormal fault troubleshooting levels, acquiring a reference cluster corresponding to a target cluster;
and sending the service data in the target cluster to the reference cluster so that the reference cluster executes the service corresponding to the target cluster.
It can be understood that each unit of the fault handling apparatus of this embodiment may be specifically implemented according to the method in the foregoing method embodiment fig. 2 or fig. 4, and a specific implementation process thereof may refer to the description related to the method embodiment fig. 2 or fig. 4, which is not described herein again.
In this embodiment of the present application, the failure processing device may respond to a trigger operation for performing failure processing on a target cluster, and obtain a target troubleshooting path corresponding to the target cluster, where the target troubleshooting path includes N troubleshooting levels. And carrying out fault troubleshooting on the target cluster based on the N fault troubleshooting levels in the target fault troubleshooting path to obtain a fault troubleshooting result of each fault troubleshooting level corresponding to the target cluster. Determining M abnormal troubleshooting levels according to a troubleshooting result of each troubleshooting level of the target cluster; and determining a strategy corresponding to each abnormal troubleshooting level in the M abnormal troubleshooting levels, and executing fault processing based on the strategy corresponding to each abnormal troubleshooting level. The fault processing equipment in the embodiment of the application can directly perform fault troubleshooting on the target cluster according to the target fault troubleshooting path without manual participation, so that the manpower resource can be effectively saved, and the fault processing efficiency is improved; in addition, the method is not limited by the operation and maintenance capability of the operation and maintenance user, and can be used for quickly and accurately positioning the fault. In addition, since the strategy is preset for the abnormal troubleshooting level, the fault processing can be directly executed based on the strategy corresponding to the abnormal troubleshooting level. The fault processing process is not limited by the operation and maintenance capability of the operation and maintenance user, fault processing can be executed quickly and accurately, and the efficiency and accuracy of fault processing are further improved.
It should be noted that the present application also provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the fault handling device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, causing the fault handling device to perform the steps performed in fig. 2 or fig. 4 of the above-described fault handling method embodiments.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.