Diagnose and replace a defective disk

You can check the status of a disk using SMART (Self-Monitoring, Analysis and Reporting Technology) attributes. If the check reveals that the disk is faulty, you can replace the faulty disk.

Check disk health

Obtain SMART attributes.
Assess SMART attribute values.

1. Obtain SMART attributes

The method for obtaining SMART attributes depends on the operating system installed on the server and how the disk is connected to the server:

without a RAID controller — the disk is connected directly to the motherboard or via an HBA controller;
via a RAID controller — the disk is connected via an Adaptec or MegaRAID controller installed on the server.

Linux
Windows

Without a RAID controller
Adaptec
MegaRAID

Connect to the server via SSH or via KVM console.
Install the smartmontools package — a set of utilities for monitoring the status of HDDs and SSDs that support SMART technology.
```
apt-get install smartmontools
```

Output the disk information:

lsblk

The response will contain information about the disks. Remember or copy the disk names. For example:

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    0   1.8T  0 disk
└─sda1        8:1    0   1.8T  0 part /mnt/data
sdb           8:16   0 931.5G  0 disk
└─sdb1        8:17   0 931.5G  0 part /mnt/backup
nvme0n1     259:0    0 465.8G  0 disk
├─nvme0n1p1 259:1    0   512M  0 part /boot/efi
├─nvme0n1p2 259:2    0    16G  0 part [SWAP]
└─nvme0n1p3 259:3    0 449.3G  0 part /

Here sda, sdb, nvme0n1 are the disk names.

Start reading SMART attributes. The command to run depends on the disk interface:
- for SATA:
```
smartctl -iA /dev/<disk_name>
```
Specify <disk_name> — the disk name you copied in step 3.
- for NVME:
```
nvme smart-log /dev/<disk_name>
```
Specify <disk_name> — the disk name you copied in step 3.

2. Assess SMART attributes

A disk is considered defective if at least one of the SMART attributes meets the specified conditions.

HDD disks
SSD disks
NVME drives

	Attribute description	Field	Attribute value
5 Reallocated_Sector_Ct	Number of sectors reallocated due to errors	RAW_VALUE	> 0
7 Seek_Error_Rate	Error rate during positioning of the head assembly	VALUE	< 45
9 Power_on_hours	Power-on hours	RAW_VALUE	> 43800
10 Spin_Retry_Count	Number of retries to spin up the disk to operating speed if the first attempt failed	RAW_VALUE	> 10
197 Current_Pending_Sector	Number of sectors in the waiting queue for reallocation	RAW_VALUE	> 0
198 Offline_Uncorrectable	Number of sectors on the disk that the disk controller tried to fix by itself	RAW_VALUE	> 0

	Attribute description	Field	Attribute value
175 Power Loss Protection Failure	Status of the mechanism that prevents data corruption during a sudden power loss	VALUE	< 10
184 End-to-End Error Detection Count	Number of errors detected during data transfer between the RAID controller and memory and back	RAW_VALUE	> 9
231 Life Left (SSDs) or Temperature	Percentage of remaining SSD life until the end of service life	VALUE	< 11
232 Available Reserved Space	Indicates that spare blocks for replacing damaged memory cells are almost exhausted	VALUE	< 11
233 Media Wearout Indicator	SSD flash memory wear level	VALUE	< 11

	Attribute description	Attribute value
Available Spare	Percentage of spare memory cells that can be used instead of those that have failed	< 11
Percentage Used	Memory wear level	> 105
Media and Data Integrity Errors	Number of write or read errors on the media	> 0

Replace a defective disk

You can determine if a disk is faulty by checking the disk status. If the evaluation of SMART attributes shows the disk is faulty, you can initiate its replacement. To do this:

Obtain the serial number of the defective disk.
Coordinate the disk replacement.
If the disk is added to a RAID array, remove the disk from the RAID array.
Illuminate the disk.
Check the disk in the system.
If the disk was in a RAID array, add the disk to the RAID array.

1. Obtain the serial number of the defective disk

Linux
Windows

Without a RAID controller
Adaptec
MegaRAID

Connect to the server via SSH or via KVM console.
Obtain the serial number of the defective disk; to do this, display information about the disks:
```
lsblk -o name,serial,model
```
The response will contain information about the disks. Copy the serial number of the defective disk. For example:
```
NAME    SERIAL            MODEL
sdb     S0H0N0XYZ123456   Samsung SSD 970 EVO Plus 500GB
nvme0n1 S0D0NX0M001234    Samsung SSD 980 PRO 1TB
```
Here SERIAL is the disk serial number.

2. Coordinate the disk replacement

Create a ticket. In the ticket, specify:
- the obtained SMART attributes;
- the serial number of the defective disk.
If the disk replacement is approved, a Servercore employee will specify a convenient time and the duration of the work. The duration of the work will be required to determine the time for illuminating the disk.

3. Remove the disk from the RAID array

If the disk is in a RAID array, remove the disk from the array.

4. Illuminate the disk

At the time designated for the work, we will inform you in the ticket that we are ready to proceed with the disk replacement.

If the disk cannot be lit up and the engineers cannot identify it by its serial number, the server will need to be powered off to replace the disk. In this case, we will notify you of the disk identification problem and coordinate the server power-off time in a ticket.

Linux
Windows

Without a RAID controller
Adaptec
MegaRAID

To light up a disk, put a heavy load on it, such as by running a read or write operation. If you remove the disk while these operations are running, read errors will occur. This is normal behavior, as the command is attempting to access data on a disk that has already been removed.

Connect to the server via SSH or via KVM console.

Output the disk information:

lsblk

The response will contain information about the disks. Remember or copy the disk name. For example:

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    0   1.8T  0 disk
└─sda1        8:1    0   1.8T  0 part /mnt/data
sdb           8:16   0 931.5G  0 disk
└─sdb1        8:17   0 931.5G  0 part /mnt/backup
nvme0n1     259:0    0 465.8G  0 disk
├─nvme0n1p1 259:1    0   512M  0 part /boot/efi
├─nvme0n1p2 259:2    0    16G  0 part [SWAP]
└─nvme0n1p3 259:3    0 449.3G  0 part /

Here sda, sdb, nvme0n1 are the disk names.

Illuminate the disk:
```
dd if=/dev/<disk_name> of=/dev/null
```
Specify <disk_name> — the disk name you copied in step 2.

5. Check the disk in the system

Linux
Windows

Without a RAID controller
Adaptec
MegaRAID

Wait for a message from a Servercore employee in the ticket confirming that the disk has been replaced.
Connect to the server via SSH or via KVM console.
Make sure the disk has initialized in the system:
```
lsblk
```
If the disk is missing from the list, restart the server. If the disk has not initialized in the system after the restart, inform us in the ticket.

6. Add the disk to the RAID array

If the disk was in a RAID array, add the replaced disk to the array.