Procedure
Run the following command on a cluster node to obtain the physical disk information under the specified path:
zbs-chunk query disk <path>
Output example
{ 'exist': True,
'formatted': False,
'in_use': False,
'instance_id': 0}
Query SuccessOutput note
| Parameter | Description |
|---|---|
exist | Whether the physical disk exists. |
formatted | Whether the physical disk is formatted by a chunk. |
in_use | Whether the physical disk is used by a chunk. |
Use the disk name or serial number of the physical disk to query its health details.
Procedure
Querying by physical disk name
On the node where the physical disk is located, run the following command to query the health details of the physical disk:
zbs-node show_disk_status [-p] [/dev/]<disk_name> [--with_rawdata]
Here, disk_name indicates the device name of the physical disk. If the disk is detected and marked as a high-latency disk, the --with_rawdata option can be used to output the disk's raw disk status data.
Operation example
The following three commands deliver the same result:
zbs-node show_disk_status -p /dev/sdc
zbs-node show_disk_status /dev/sdc
zbs-node show_disk_status sdcQuerying by physical disk serial number
On any node in the cluster, run the following command to query the health details of the physical disk:
zbs-node show_disk_status -s <disk_serial>
Here, disk_serial indicates the serial number of the physical disk.
Operation example
zbs-node show_disk_status -s 9XG6GTFN
Output example
== Base Information ==
is healthy : True
device name : /dev/sdc
bus type : ata
model : ST91000640NS
firmware : SN03
disk serial : 9XG6GTFN
last belong to : 10.0.67.212
== Fault Detection ==
chunk errflag detected : False
chunk warnflag detected : False
chunk io error count overflow : False
chunk checksum error count overflow : False
iostat latency detected : False
smart error detected : False
software raid faulty detected : False
offline due to io timeout : False
offline due to cmd abort : False
offline due to error queue : False
reallocated sectors count overflow : False
== Extra Fault Detection ==
chunk io errors count : -
chunk checksum errors count : -
io latency (ms) : -
iops : -
sectors per second : -
bandwidth (MiB/s) : -
smartctl hang process : -
S.M.A.R.T. assessment error : -
reallocated sectors count : -
== S.M.A.R.T. Attributes ==
ID# ATTRIBUTE_NAME VALUE THRESH RAW CHECK_FIELD CHECK_THRESH CHECK_RES
5 Reallocated_Sector_Ct 100 036 1 raw 10 True
187 Reported_Uncorrect 100 000 0 raw 0 True
188 Command_Timeout 100 000 0 value 10 True
194 Temperature_Celsius 018 000 18 raw 45 True
197 Current_Pending_Sector 100 000 0 raw 0 True
198 Offline_Uncorrectable 100 000 0 raw 0 TrueOutput note
| Parameter | Description |
|---|---|
is healthy | Whether the physical disk is healthy. |
bus type | The bus interface type, e.g., ata, scsi. |
model | The physical disk model. |
firmware | The firmware version. |
disk path | The physical disk path. |
disk serial | The physical disk serial number. |
trace id | The system identifier for tracking physical disk status (typically the serial number, but can also be NGUID or other information). |
controller | The driver type used by the physical disk controller. |
last belong to | The IP address of the last chunk to which the disk belongs. |
chunk errflag detected | Whether physical disks in an error state are detected. |
chunk warnflag detected | Whether physical disks in a sub-healthy state are detected. |
chunk io error count overflow | Whether an I/O error is detected (I/O error count exceeds the threshold). |
chunk checksum error count overflow | Whether checksum errors are detected by LSM. |
iostat latency detected | Whether slow disk anomalies are detected through iostat output. |
smart error detected | Whether an error is detected through SMART information. |
software raid faulty detected | Whether a software RAID failure is detected. |
offline due to io timeout | Whether the disk goes offline due to an I/O timeout. |
offline due to cmd abort | Whether the disk goes offline due to a command abort. |
offline due to error queue | Whether the disk goes offline due to an error queue. |
reallocated sectors count overflow | Whether the reallocated sectors count exceeds the threshold. |
Reallocated_Sector_Ct | The number of reallocated sectors, basically representing the number of failed sectors. |
Reported_Uncorrect | The uncorrectable error that cannot be corrected by hardware ECC. |
Command_Timeout | The communication timeout error: failure to connect to the hard disk. |
Temperature_Celsius | The temperature. |
Current_Pending_Sector | The current pending sector count, indicating the number of unstable sectors. |
Offline_Uncorrectable | The count of offline uncorrectable sectors. |
After resolving a physical disk failure, reset the physical disk health status. Otherwise, it cannot continue to be mounted and used.
Procedure
On the node where the physical disk is located, run the following command to reset the health status of the physical disk:
zbs-node set_disk_healthy [/dev/]<disk_name>
Here, disk_name indicates the disk name of the physical disk.
Output example
2024-06-18 14:07:43,908 node.py 769 [13302] [INFO] set chunk partition healthy.
2024-06-18 14:07:45,591 node.py 781 [13302] [INFO] set tuna disk record healthy.
2024-06-18 14:07:46,770 node.py 787 [13302] [INFO] clean from slow disk record.
2024-06-18 14:07:46,800 node.py 793 [13302] [INFO] Setting disk sdc healthy succeed.