Diagnose and replace a defective disk
You can check the status of the disk using SMART (Self-Monitoring, Analysis and Reporting Technology) attributes. If the disk is found to be defective based on the check results, you can replace the defective disk.
Check disk health
1. Obtain SMART attributes
The method for obtaining SMART attributes depends on the operating system installed on the server and how the disk is connected to the server:
- without a RAID controller — the disk is connected directly to the motherboard or via an HBA controller;
- via a RAID controller — the disk is connected via an Adaptec or MegaRAID controller installed on the server.
Linux
Windows
Without a RAID controller
Adaptec
MegaRAID
-
Install the
smartmontoolspackage — a set of utilities for monitoring the status of HDDs and SSDs that support SMART technology.apt-get install smartmontools -
Output the disk information:
lsblkThe response will contain information about the disks. Remember or copy the disk names. For example:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTSsda 8:0 0 1.8T 0 disk└─sda1 8:1 0 1.8T 0 part /mnt/datasdb 8:16 0 931.5G 0 disk└─sdb1 8:17 0 931.5G 0 part /mnt/backupnvme0n1 259:0 0 465.8G 0 disk├─nvme0n1p1 259:1 0 512M 0 part /boot/efi├─nvme0n1p2 259:2 0 16G 0 part [SWAP]└─nvme0n1p3 259:3 0 449.3G 0 part /Here
sda,sdb,nvme0n1are the disk names. -
Start reading SMART attributes. The command to run depends on the disk interface:
- for SATA:
smartctl -iA /dev/<disk_name>Specify
<disk_name>— the disk name you copied in step 3.- for NVME:
nvme smart-log /dev/<disk_name>Specify
<disk_name>— the disk name you copied in step 3.
2. Assess SMART attributes
A disk is considered defective if at least one of the SMART attributes meets the specified conditions.
HDD disks
SSD disks
NVME drives
Replace a defective disk
The failure of a disk can be determined by checking the disk status. If the disk is found to be defective after assessing the SMART attributes, you can initiate its replacement. To do this:
- Obtain the serial number of the defective disk.
- Coordinate the disk replacement.
- If the disk is added to a RAID array, remove the disk from the RAID array.
- Illuminate the disk.
- Check the disk in the system.
- If the disk was in a RAID array, add the disk to the RAID array.
1. Obtain the serial number of the defective disk
Linux
Windows
Without a RAID controller
Adaptec
MegaRAID
-
Obtain the serial number of the defective disk; to do this, display information about the disks:
lsblk -o name,serial,modelThe response will contain information about the disks. Copy the serial number of the defective disk. For example:
NAME SERIAL MODELsdb S0H0N0XYZ123456 Samsung SSD 970 EVO Plus 500GBnvme0n1 S0D0NX0M001234 Samsung SSD 980 PRO 1TBHere
SERIALis the disk serial number.
2. Coordinate the disk replacement
-
Create a ticket. In the ticket, specify:
-
If the disk replacement is approved, a Servercore employee will specify a convenient time and the duration of the work. The duration of the work will be required to determine the time for illuminating the disk.
3. Remove the disk from the RAID array
If the disk is in a RAID array, remove the disk from the array.
4. Illuminate the disk
At the time designated for the work, we will inform you in the ticket that we are ready to proceed with the disk replacement.
If the disk cannot be illuminated and the engineers are unable to identify it by its serial number, then the server will need to be shut down to replace the disk. In this case, we will report the issue with disk identification and arrange a server shutdown time in the ticket.
Linux
Windows
Without a RAID controller
Adaptec
MegaRAID
To illuminate the disk, create a load on it, for example, by running a write or read operation. If you remove the disk while these operations are being performed, read errors will occur. This is normal behavior, as the command attempts to access data on a disk that has already been removed.
-
Output the disk information:
lsblkThe response will contain information about the disks. Remember or copy the disk name. For example:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTSsda 8:0 0 1.8T 0 disk└─sda1 8:1 0 1.8T 0 part /mnt/datasdb 8:16 0 931.5G 0 disk└─sdb1 8:17 0 931.5G 0 part /mnt/backupnvme0n1 259:0 0 465.8G 0 disk├─nvme0n1p1 259:1 0 512M 0 part /boot/efi├─nvme0n1p2 259:2 0 16G 0 part [SWAP]└─nvme0n1p3 259:3 0 449.3G 0 part /Here
sda,sdb,nvme0n1are the disk names. -
Illuminate the disk:
dd if=/dev/<disk_name> of=/dev/nullSpecify
<disk_name>— the disk name you copied in step 2.
5. Check the disk in the system
Linux
Windows
Without a RAID controller
Adaptec
MegaRAID
-
Wait for a message from a Servercore employee in the ticket confirming that the disk has been replaced.
-
Make sure the disk has initialized in the system:
lsblk -
If the disk is missing from the list, restart the server. If the disk has not initialized in the system after the restart, inform us in the ticket.
6. Add the disk to the RAID array
If the disk was in a RAID array, add the replaced disk to the array.