Method and system for positioning fault hard disk
1. A method for positioning a failed hard disk is characterized by comprising the following steps:
periodically collecting array card logs to be processed corresponding to a Redundant Array of Independent Disks (RAID) to be detected, wherein the RAID to be detected is any RAID in any node server of a server cluster;
analyzing the array card log to be processed to obtain a first change value of a first preset index of each hard disk before and after PR (conditional access) is executed on the RAID to be processed, obtain a first execution time required for PR execution on the RAID to be processed, obtain a second change value of the first preset index of each hard disk before and after CC (consistency check) is executed on the RAID to be processed, and obtain a second execution time required for CC execution on the RAID to be processed;
determining whether a fault hard disk exists in the RAID to be detected according to the first execution time length, the second execution time length, the first change value and the second change value corresponding to each hard disk in the RAID to be detected;
and if so, acquiring hard disk information corresponding to the fault hard disk.
2. The method according to claim 1, wherein determining whether there is a failed hard disk in the RAID to be detected according to the first execution duration, the second execution duration, and the first change value and the second change value corresponding to each hard disk in the RAID to be detected comprises:
for each hard disk in the RAID to be detected, judging whether the hard disk meets a preset fault condition or not according to the first execution duration and the second execution duration and by combining the first change value and the second change value corresponding to the hard disk, and if so, determining that the hard disk is a fault hard disk;
wherein the preset fault condition is: the first variation value is greater than or equal to a first threshold, the second variation value is greater than or equal to a second threshold, the first execution time length is greater than or equal to a third threshold, and the second execution time length is greater than or equal to a fourth threshold.
3. The method according to claim 1, characterized in that said first preset criterion comprises at least: media error counters, expected error counters, other error counters, and hardware error counters.
4. The method according to claim 1, wherein before the periodically collecting the array card logs to be processed corresponding to the redundant array of independent disks RAID to be detected, the method further comprises:
and executing PR and CC on the RAID to be detected according to preset execution time and execution period, wherein the execution time and the execution period are determined based on a second preset index and preset information corresponding to the node server to which the RAID to be detected belongs.
5. The method according to claim 4, characterized in that said second preset criterion comprises at least: the CPU utilization rate, the memory utilization rate, the CUP wait IO, the total network card flow per second, the swap utilization rate of the switching memory, the disk busyness and the disk IO throughput.
6. The method according to claim 1, wherein after obtaining the hard disk information corresponding to the failed hard disk, the method further comprises:
acquiring application system information and busy/idle time period information of an application system associated with a node server to which the fault hard disk belongs, and acquiring performance information of an array card corresponding to the RAID to be detected;
according to the application system information, the busy and idle time period information and the array card performance information, combining a disc changing rule to make a disc changing strategy;
and sending the disk replacement strategy and an alarm notification to a specified object, wherein the alarm notification at least comprises the hard disk information corresponding to the fault hard disk.
7. A system for locating a failed hard disk, said system comprising:
the system comprises a collecting unit and a processing unit, wherein the collecting unit is used for periodically collecting array card logs to be processed corresponding to a Redundant Array of Independent Disks (RAID) to be detected, and the RAID to be detected is any RAID in any node server of a server cluster;
the analysis unit is used for analyzing the array card log to be processed to obtain a first change value of a first preset index of each hard disk before and after PR (conditional access) is executed on the RAID to be processed, obtain a first execution time required when PR is executed on the RAID to be processed, obtain a second change value of the first preset index of each hard disk before and after CC is executed on the RAID to be processed, and obtain a second execution time required when CC is executed on the RAID to be processed;
the processing unit is used for determining whether a fault hard disk exists in the RAID to be detected according to the first execution time length, the second execution time length, the first change value and the second change value corresponding to each hard disk in the RAID to be detected; and if so, acquiring hard disk information corresponding to the fault hard disk.
8. The system according to claim 7, wherein the processing unit configured to determine whether there is a failed hard disk in the RAID to be detected is specifically configured to:
for each hard disk in the RAID to be detected, judging whether the hard disk meets a preset fault condition or not according to the first execution duration and the second execution duration and by combining the first change value and the second change value corresponding to the hard disk, and if so, determining that the hard disk is a fault hard disk;
wherein the preset fault condition is: the first variation value is greater than or equal to a first threshold, the second variation value is greater than or equal to a second threshold, the first execution time length is greater than or equal to a third threshold, and the second execution time length is greater than or equal to a fourth threshold.
9. The system according to claim 7, characterized in that said first preset criterion comprises at least: media error counters, expected error counters, other error counters, and hardware error counters.
10. The system of claim 7, further comprising:
and the execution unit is used for executing PR and CC on the RAID to be detected according to preset execution time and execution period, wherein the execution time and the execution period are determined based on a second preset index and preset information corresponding to the node server to which the RAID to be detected belongs.
Background
With the development of computer technology, the computing demand and the storage demand of a server for mass data are higher and higher, and a hard disk is used as a core component for storage and computation of the server, and stable operation of the hard disk is an important factor for ensuring that the server provides stable service, so that how to timely determine that the hard disk fails and timely locate the failed hard disk is a problem to be solved urgently at present.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and a system for locating a failed hard disk, and a failed hard disk is discovered and located in time.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
the first aspect of the embodiment of the invention discloses a method for positioning a fault hard disk, which comprises the following steps:
periodically collecting array card logs to be processed corresponding to a Redundant Array of Independent Disks (RAID) to be detected, wherein the RAID to be detected is any RAID in any node server of a server cluster;
analyzing the array card log to be processed to obtain a first change value of a first preset index of each hard disk before and after PR (conditional access) is executed on the RAID to be processed, obtain a first execution time required for PR execution on the RAID to be processed, obtain a second change value of the first preset index of each hard disk before and after CC (consistency check) is executed on the RAID to be processed, and obtain a second execution time required for CC execution on the RAID to be processed;
determining whether a fault hard disk exists in the RAID to be detected according to the first execution time length, the second execution time length, the first change value and the second change value corresponding to each hard disk in the RAID to be detected;
and if so, acquiring hard disk information corresponding to the fault hard disk.
Preferably, determining whether a failed hard disk exists in the RAID to be detected according to the first execution duration, the second execution duration, and the first change value and the second change value corresponding to each hard disk in the RAID to be detected includes:
for each hard disk in the RAID to be detected, judging whether the hard disk meets a preset fault condition or not according to the first execution duration and the second execution duration and by combining the first change value and the second change value corresponding to the hard disk, and if so, determining that the hard disk is a fault hard disk;
wherein the preset fault condition is: the first variation value is greater than or equal to a first threshold, the second variation value is greater than or equal to a second threshold, the first execution time length is greater than or equal to a third threshold, and the second execution time length is greater than or equal to a fourth threshold.
Preferably, the first preset index at least includes: media error counters, expected error counters, other error counters, and hardware error counters.
Preferably, before the periodically collecting the array card logs to be processed corresponding to the redundant array of independent disks RAID to be detected, the method further includes:
and executing PR and CC on the RAID to be detected according to preset execution time and execution period, wherein the execution time and the execution period are determined based on a second preset index and preset information corresponding to the node server to which the RAID to be detected belongs.
Preferably, the second preset index at least includes: the CPU utilization rate, the memory utilization rate, the CUP wait IO, the total network card flow per second, the swap utilization rate of the switching memory, the disk busyness and the disk IO throughput.
Preferably, after obtaining the hard disk information corresponding to the failed hard disk, the method further includes:
acquiring application system information and busy/idle time period information of an application system associated with a node server to which the fault hard disk belongs, and acquiring performance information of an array card corresponding to the RAID to be detected;
according to the application system information, the busy and idle time period information and the array card performance information, combining a disc changing rule to make a disc changing strategy;
and sending the disk replacement strategy and an alarm notification to a specified object, wherein the alarm notification at least comprises the hard disk information corresponding to the fault hard disk.
The second aspect of the embodiments of the present invention discloses a system for locating a failed hard disk, where the system includes:
the system comprises a collecting unit and a processing unit, wherein the collecting unit is used for periodically collecting array card logs to be processed corresponding to a Redundant Array of Independent Disks (RAID) to be detected, and the RAID to be detected is any RAID in any node server of a server cluster;
the analysis unit is used for analyzing the array card log to be processed to obtain a first change value of a first preset index of each hard disk before and after PR (conditional access) is executed on the RAID to be processed, obtain a first execution time required when PR is executed on the RAID to be processed, obtain a second change value of the first preset index of each hard disk before and after CC is executed on the RAID to be processed, and obtain a second execution time required when CC is executed on the RAID to be processed;
the processing unit is used for determining whether a fault hard disk exists in the RAID to be detected according to the first execution time length, the second execution time length, the first change value and the second change value corresponding to each hard disk in the RAID to be detected; and if so, acquiring hard disk information corresponding to the fault hard disk.
Preferably, the processing unit configured to determine whether there is a failed hard disk in the RAID to be detected is specifically configured to:
for each hard disk in the RAID to be detected, judging whether the hard disk meets a preset fault condition or not according to the first execution duration and the second execution duration and by combining the first change value and the second change value corresponding to the hard disk, and if so, determining that the hard disk is a fault hard disk;
wherein the preset fault condition is: the first variation value is greater than or equal to a first threshold, the second variation value is greater than or equal to a second threshold, the first execution time length is greater than or equal to a third threshold, and the second execution time length is greater than or equal to a fourth threshold.
Preferably, the first preset index at least includes: media error counters, expected error counters, other error counters, and hardware error counters.
Preferably, the system further comprises:
and the execution unit is used for executing PR and CC on the RAID to be detected according to preset execution time and execution period, wherein the execution time and the execution period are determined based on a second preset index and preset information corresponding to the node server to which the RAID to be detected belongs.
Based on the above method and system for locating a failed hard disk provided by the embodiments of the present invention, the method is: periodically collecting array card logs to be processed corresponding to the RAID to be detected; analyzing array card logs to be processed to obtain first change values of first preset indexes of each hard disk before PR execution and after PR execution of the RAID to be detected, obtain first execution time required by PR execution of the RAID to be detected, obtain second change values of the first preset indexes of each hard disk before CC execution and after CC execution of the RAID to be detected, and obtain second execution time required by CC execution of the RAID to be detected; determining whether a fault hard disk exists in the RAID to be detected according to a first change value, a first execution time length, a second change value and a second execution time length corresponding to each hard disk in the RAID to be detected; and if so, acquiring hard disk information corresponding to the failed hard disk. And determining the hard disk with the fault according to the change values of the preset indexes of the hard disk before and after the PR and the CC are executed and the corresponding execution duration of the hard disk when the PR and the CC are executed, so as to accurately and timely position the fault hard disk.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for locating a failed hard disk according to an embodiment of the present invention;
fig. 2 is another flowchart of a method for locating a failed hard disk according to an embodiment of the present invention;
fig. 3 is a block diagram of a system for locating a failed hard disk according to an embodiment of the present invention;
fig. 4 is another structural block diagram of a system for locating a failed hard disk according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
As can be seen from the background art, in order to ensure that a server can provide a stable service, a failed hard disk needs to be determined and located in time, and therefore how to locate the failed hard disk is a problem that needs to be solved urgently at present.
Therefore, embodiments of the present invention provide a method and a system for locating a failed hard disk, where a change value of a preset index of a hard disk before and after PR and CC are executed on a RAID is obtained by analyzing an array card log, and an execution duration required when PR and CC are executed on the RAID is obtained. And determining whether the hard disk fails or not by combining the change value and the execution time of the preset index corresponding to the hard disk so as to accurately and timely position the failed hard disk.
It should be noted that a plurality of english abbreviations are referred to in the contents shown in the embodiments of the present invention, and the following description explains english abbreviations referred to in the embodiments of the present invention in advance.
RAID: redundant array of independent disks.
RAID card: namely an array card, for implementing a RAID mode card.
PR: patrol Read, i.e., Patrol.
CC: consistency Check.
CMDB: configuration management Database, i.e., a configuration management Database.
Referring to fig. 1, a flowchart of a method for positioning a failed hard disk according to an embodiment of the present invention is shown, where the method includes:
step S101: and periodically collecting array card logs to be processed corresponding to the RAID to be detected.
It should be noted that the server cluster is composed of a plurality of node servers, each node server includes a plurality of RAIDs, each RAID is composed of a plurality of hard disks, the RAIDs included in the node servers can be distinguished (specifically distinguished by an array number) by an array card log, the array card log further includes at least a hard disk slot number, and the hard disk slot number can be used for distinguishing different hard disks in the RAID.
It is understood that the RAID to be detected is any RAID in any node server of the server cluster.
In the process of implementing step S101, to-be-processed array card logs corresponding to the RAID to be detected are periodically collected.
Preferably, before collecting the array card log to be processed corresponding to the RAID to be detected, the RAID card needs to be used to execute PR and CC, it can be understood that part of the computing resources are occupied when the PR and CC are executed, and in order not to affect normal operation of the node server, time and a period for executing the PR and the CC (which is equivalent to a timing task) need to be set according to a busy degree of the node server. On the basis, according to the execution time and the execution period for executing the PR and the CC, the array card log to be processed is periodically acquired, and the array card log to be processed is the array card log before and after the PR and the CC are executed.
It should be noted that the execution time and the execution cycle for executing PR and CC may also be adjusted according to actual requirements or failure rates of the RAID to be detected, and the manner of formulating the execution time and the execution cycle is not limited to the above-mentioned manner.
In some embodiments, the second predetermined criteria at least includes: central Processing Unit (CPU) utilization, memory utilization, CPU wait IO (i.e., input output), total network card traffic per second, swap memory (swap) utilization, disk busyness, and disk IO throughput. The preset information corresponding to the node server to which the RAID to be detected belongs at least comprises the following information: and application system information (such as an application system name, importance level information, a service level agreement, a contact information between an application manager and the application manager, and the like) corresponding to the application system associated with the node server.
It should be noted that the second preset index mentioned above can be obtained by acquiring information of each index item corresponding to the node server to which the RAID to be detected belongs, and the content of the index item for acquiring the second preset index is shown in table 1.
Table 1:
index item
Explanation of the index item
CpuUtil
CPU utilization
SedMemPerccent
Memory usage rate
IOwait
CUP wait for IO
NET_RATE
Total network card flow per second
SwapUsedPercent
swap utilization rate
DISKPercentBusy
Disk busyness
DISKIORate
Disk IO throughput
It should be noted that the index items shown in table 1 for acquiring the second preset index are only used for illustration, and in practical applications, the index items for acquiring the second preset index may be determined according to actual requirements, for example, each index item for acquiring the second preset index may be increased or decreased on the basis of table 1, or other index items for acquiring the second preset index may be selected according to actual requirements, which is not specifically limited herein.
Step S102: analyzing the array card log to be processed to obtain a first change value of a first preset index of each hard disk before and after PR execution of the RAID to be detected, obtain a first execution time length required by PR execution of the RAID to be detected, obtain a second change value of the first preset index of each hard disk before and after CC execution of the RAID to be detected, and obtain a second execution time length required by CC execution of the RAID to be detected.
It should be noted that, after the RAID card is used to execute PR on the RAID to be detected, the first preset index of each hard disk in the RAID to be detected changes correspondingly, and similarly, after the RAID card is used to execute CC on the RAID to be detected, the first preset index of each hard disk in the RAID to be detected also changes correspondingly. Before and after executing PR and executing CC, the value of the first preset index of each hard disk is recorded in the array card log corresponding to the RAID to be detected, the value of the first preset index of each hard disk before executing PR and executing CC can be obtained by collecting the array card logs before and after executing PR and executing CC, and the value of the first preset index of each hard disk after executing PR and executing CC can be obtained.
It should be further noted that PR and CC are used to repair hard disk errors or to repair inconsistent data, and if PR or CC is executed for too long, it indicates that there are more hard disk errors and that hard disk read/write capability is poor, so that the time required for executing PR and CC may be used as one of the bases for determining whether a hard disk fails, the time required for executing PR and CC may be obtained from the array card log, specifically, the start time for executing PR, the end time for executing PR, the start time for executing CC, and the end time for executing CC may be obtained from the array card log, and the time required for executing PR may be determined according to the start time and the end time for executing PR, and the time required for executing CC may be determined according to the start time and the end time for executing CC.
In the process of implementing the step S102 specifically, analyzing the array card log to be processed to obtain a value of a first preset index of each hard disk (each hard disk in the RAID to be detected) before PR is executed on the RAID to be detected, and obtain a value of the first preset index of each hard disk after PR is executed on the RAID to be detected, and by using the value of the first preset index of each hard disk before PR is executed on the RAID to be detected and the value of the first preset index of each hard disk after PR is executed on the RAID to be detected, a first change value of the first preset index of each hard disk before PR is executed on the RAID to be detected and after PR is executed on the RAID to be detected can be determined; after the log of the array card to be processed is analyzed, a first execution duration (determined according to the starting time and the ending time for executing the PR) required when the PR is executed by the RAID to be detected can also be obtained.
Similarly, analyzing the array card log to be detected to obtain a value of a first preset index of each hard disk before CC execution is performed on the RAID to be detected and a value of the first preset index of each hard disk after CC execution is performed on the RAID to be detected, and determining to obtain a second change value of the first preset index of each hard disk before CC execution is performed on the RAID to be detected and after CC execution is performed on the RAID to be detected according to the value of the first preset index of each hard disk before CC execution is performed on the RAID to be detected and the value of the first preset index of each hard disk after CC execution is performed on the RAID to be detected; after the array card log to be processed is analyzed, a second execution time length (determined according to the start time and the end time of executing the CC) required when the CC is executed by the RAID to be detected can also be obtained.
In some embodiments, the first predetermined criteria includes at least: media error counters, expected error counters, other error counters, and hardware error counters.
In some embodiments, when the value of the first preset index of each hard disk before and after PR execution and before and after CC execution is obtained from the array card log to be processed, the value of the first preset index can be obtained by acquiring information of each index item corresponding to the node server to which the RAID to be detected belongs, and the content of the index item for acquiring the value of the first preset index is described in table 2.
Table 2:
index item
Explanation of the index item
Slot Number
Hard disk slot number
Media Error Count
Media error counter
Predictive FailureCount
Expected error counter
Other Error Count
Other error counters
Hardware Error Count
Hardware error counter
It is understood that the index entry of the hard disk slot number in table 2 is used to indicate the corresponding hard disk.
It should be noted that the index items shown in table 2 for acquiring the value of the first preset index are only used for illustration, and in practical applications, the index items for acquiring the value of the first preset index may be determined according to actual requirements, for example, each index item for acquiring the value of the first preset index may be increased or decreased on the basis of table 2, or other index items for acquiring the value of the first preset index may be selected according to actual requirements, which is not specifically limited herein.
With the above, the first execution time length is determined according to the start time of executing the PR on the RAID to be detected and the end time of executing the PR, and the second execution time length is determined according to the start time of executing the CC on the RAID to be detected and the end time of executing the CC. In a specific implementation, the start time of executing PR, the end time of executing PR, the start time of executing CC, and the end time of executing CC may be obtained by matching a keyword from the array card log to be processed, where the specific content of the keyword is as shown in table 3.
Key word
Definition of key word
Patrol Readstarted
PR Start execution time
Consistency Checkstarted
CC Start execution time
Patrol Readcompleted
PR end time
Consistency Checkdone
CC end time
Step S103: and determining whether the RAID to be detected has a fault hard disk or not according to the first execution time length, the second execution time length, and the first change value and the second change value corresponding to each hard disk in the RAID to be detected. If yes, step S104 is executed, and if not, the process returns to step S101.
In the process of implementing step S103 specifically, for each hard disk in the RAID to be detected, according to the first execution duration and the second execution duration, and by combining the first change value and the second change value of the first preset index corresponding to the hard disk, it is determined whether the hard disk meets the preset fault condition, and if yes, it is determined that the hard disk is a faulty hard disk. The preset fault conditions are as follows: the first change value is greater than or equal to a first threshold value, the second change value is greater than or equal to a second threshold value, the first execution time length is greater than or equal to a third threshold value, and the second execution time length is greater than or equal to a fourth threshold value. And if the RAID to be detected does not have the fault hard disk, returning to the step S101, continuously acquiring a new array card log to be processed, and continuously monitoring the hard disk of the RAID to be detected.
As can be seen from the above, there are multiple first preset indexes of the hard disk, and in a specific implementation, for each hard disk in the RAID to be detected, when the first execution duration, the second execution duration, the first variation value and the second variation index value of any one or more first preset indexes of the hard disk satisfy the following fault condition, it is determined that the hard disk is a faulty hard disk. The fault conditions are: the first change value is greater than or equal to a first threshold, the second change value is greater than or equal to a second threshold, the first execution time length is greater than or equal to a third threshold, the second execution time length is greater than or equal to a fourth threshold, and the thresholds can be adjusted according to actual conditions.
Step S104: and acquiring hard disk information corresponding to the fault hard disk.
In the process of implementing step S104 specifically, if a failed hard disk is determined from the RAID to be detected, the hard disk information of the failed hard disk is obtained, where the hard disk information indicates which hard disk of which RAID of which node server the failed hard disk is.
Preferably, after determining that a failed hard disk exists in the RAID to be detected and acquiring hard disk information corresponding to the failed hard disk, acquiring application system information and busy/idle time period information of an application system associated with a node server to which the failed hard disk belongs from the CMBD, and acquiring array card performance information corresponding to the RAID to be detected (that is, array card performance information of an array card for executing PR and CC to the RAID to be detected), where the application system information at least includes: the array card performance information is used for determining the busy and idle time period of the hard disk; according to the application system information, the busy and idle time period information and the array card performance information, a disc changing strategy is formulated by combining a disc changing rule; and sending the disk replacement strategy and an alarm notice to a specified object (such as operation and maintenance personnel), enabling the specified object to know the hard disk information of the failed hard disk according to the alarm notice, and enabling the specified object to process the failed hard disk according to the disk replacement strategy, wherein the alarm notice at least comprises the hard disk information corresponding to the failed hard disk.
And positioning the fault hard disk for each RAID of each node server in the server cluster according to the contents provided in the steps S101 to S104.
In the embodiment of the invention, the change values of the preset indexes of the hard disks before and after PR and CC are executed on the RAID are obtained by analyzing the array card logs, and the execution duration required when PR and CC are executed on the RAID is obtained. And determining whether the hard disk fails or not by combining the change value of the preset index corresponding to the hard disk and the execution time length so as to accurately and timely position the failed hard disk and further ensure the stable operation of the server cluster.
To better explain the above-described embodiment of the invention, what is shown in fig. 1 is illustrated by way of example in fig. 2.
Referring to fig. 2, another flowchart of a method for locating a failed hard disk according to an embodiment of the present invention is shown, including the following steps:
step S201: and acquiring preset information corresponding to the node server to which the RAID to be detected belongs by utilizing the CMDB.
It should be noted that the preset information corresponding to the node server to which the RAID to be detected belongs at least includes: the application system information corresponding to the application system associated with the node server, for example: application system name, importance level information, service level agreement, contact information of application manager and application manager, and the like.
Step S202: and acquiring a second preset index corresponding to the node server to which the RAID to be detected belongs.
The specific contents of the second preset index are shown in table 1.
Step S203: and determining an execution time and an execution period for executing the PR and the CC based on the second preset index and the preset information.
Step S204: and periodically collecting array card logs to be processed corresponding to the RAID to be detected.
Step S205: analyzing the array card log to be processed to obtain a first execution time length, a second execution time length, a first change value of a first preset index and a second change value of the first preset index of each hard disk in the RAID to be detected, and determining the failed hard disk in the RAID to be detected according to the first change value and the second change value of the first preset index.
Step S206: and determining whether a fault hard disk exists in the RAID to be detected, if so, executing the step S207, and if not, returning to execute the step S204.
Step S207: and establishing a disc replacement strategy.
Step S208: and processing the failed hard disk according to the disk replacement strategy.
Corresponding to the above method for positioning a failed hard disk provided in the embodiment of the present invention, referring to fig. 3, an embodiment of the present invention further provides a structural block diagram of a positioning system for a failed hard disk, where the positioning system includes: the system comprises an acquisition unit 301, an analysis unit 302 and a processing unit 303;
the acquisition unit 301 is configured to periodically acquire the array card logs to be processed corresponding to the RAID to be detected, where the RAID to be detected is any RAID in any node server of the server cluster.
The analyzing unit 302 is configured to analyze the array card log to be detected, to obtain a first change value of a first preset index of each hard disk before and after PR execution of the RAID to be detected, to obtain a first execution duration required when PR execution is performed on the RAID to be detected, to obtain a second change value of the first preset index of each hard disk before and after CC execution is performed on the RAID to be detected, and to obtain a second execution duration required when CC execution is performed on the RAID to be detected.
In a specific implementation, the first preset index at least includes: media error counters, expected error counters, other error counters, and hardware error counters.
The processing unit 303 is configured to determine whether a failed hard disk exists in the RAID to be detected according to the first execution duration, the second execution duration, and the first change value and the second change value corresponding to each hard disk in the RAID to be detected; and if so, acquiring hard disk information corresponding to the failed hard disk.
In a specific implementation, the processing unit 303, configured to determine whether a failed hard disk exists in the RAID to be detected, is specifically configured to: for each hard disk in the RAID to be detected, judging whether the hard disk meets a preset fault condition or not according to the first execution time length and the second execution time length and by combining a first change value and a second change value corresponding to the hard disk, and if so, determining the hard disk as a fault hard disk; the preset fault conditions are as follows: the first change value is greater than or equal to a first threshold value, the second change value is greater than or equal to a second threshold value, the first execution time length is greater than or equal to a third threshold value, and the second execution time length is greater than or equal to a fourth threshold value.
Preferably, the processing unit 303 is further configured to: acquiring application system information and busy/idle time period information of an application system associated with a node server to which a fault hard disk belongs, and acquiring array card performance information corresponding to a RAID to be detected; according to the application system information, the busy and idle time period information and the array card performance information, combining with a disc changing rule, and making a disc changing strategy; and sending the disk replacement strategy and an alarm notification to a specified object, wherein the alarm notification at least comprises the hard disk information corresponding to the fault hard disk.
In the embodiment of the invention, the change values of the preset indexes of the hard disks before and after PR and CC are executed on the RAID are obtained by analyzing the array card logs, and the execution duration required when PR and CC are executed on the RAID is obtained. And determining whether the hard disk fails or not by combining the change value of the preset index corresponding to the hard disk and the execution time length so as to accurately and timely position the failed hard disk and further ensure the stable operation of the server cluster.
Preferably, referring to fig. 4 in conjunction with fig. 3, a structural block diagram of a positioning system for a failed hard disk according to an embodiment of the present invention is shown, where the positioning system further includes:
the execution unit 304 is configured to execute PR and CC on the RAID to be detected according to a preset execution time and execution period, where the execution time and execution period are determined based on a second preset index and preset information corresponding to the node server to which the RAID to be detected belongs.
In a specific implementation, the second preset index at least includes: CPU utilization rate, memory utilization rate, CUP wait IO, total network card flow per second, swap utilization rate, disk busyness and disk IO throughput.
In summary, embodiments of the present invention provide a method and a system for locating a failed hard disk, where a change value of a preset index of a hard disk before and after PR and CC are executed on a RAID is obtained by analyzing an array card log, and an execution duration required when PR and CC are executed on the RAID is obtained. And determining whether the hard disk fails or not by combining the change value and the execution time of the preset index corresponding to the hard disk so as to accurately and timely position the failed hard disk.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.