Skip to main content

Diagnose and replace a defective disk

You can check the status of the disk using SMART (Self-Monitoring, Analysis and Reporting Technology) attributes. If the disk is found to be defective based on the check results, you can replace the defective disk.

Check disk health

  1. Obtain SMART attributes.
  2. Assess SMART attribute values.

1. Obtain SMART attributes

The method for obtaining SMART attributes depends on the operating system installed on the server and how the disk is connected to the server:

  • without a RAID controller — the disk is connected directly to the motherboard or via an HBA controller;
  • via a RAID controller — the disk is connected via an Adaptec or MegaRAID controller installed on the server.
  1. Connect to the server via SSH or via KVM console.

  2. Install the smartmontools package — a set of utilities for monitoring the status of HDDs and SSDs that support SMART technology.

    apt-get install smartmontools
  3. Output the disk information:

    lsblk

    The response will contain information about the disks. Remember or copy the disk names. For example:

    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
    sda 8:0 0 1.8T 0 disk
    └─sda1 8:1 0 1.8T 0 part /mnt/data
    sdb 8:16 0 931.5G 0 disk
    └─sdb1 8:17 0 931.5G 0 part /mnt/backup
    nvme0n1 259:0 0 465.8G 0 disk
    ├─nvme0n1p1 259:1 0 512M 0 part /boot/efi
    ├─nvme0n1p2 259:2 0 16G 0 part [SWAP]
    └─nvme0n1p3 259:3 0 449.3G 0 part /

    Here sda, sdb, nvme0n1 are the disk names.

  4. Start reading SMART attributes. The command to run depends on the disk interface:

    • for SATA:
    smartctl -iA /dev/<disk_name>

    Specify <disk_name> — the disk name you copied in step 3.

    • for NVME:
    nvme smart-log /dev/<disk_name>

    Specify <disk_name> — the disk name you copied in step 3.

2. Assess SMART attributes

A disk is considered defective if at least one of the SMART attributes meets the specified conditions.

Attribute descriptionFieldAttribute value
5 Reallocated_Sector_CtNumber of sectors reallocated due to errorsRAW_VALUE> 0
7 Seek_Error_RateError rate during positioning of the head assemblyVALUE< 45
9 Power_on_hoursPower-on hoursRAW_VALUE> 43800
10 Spin_Retry_CountNumber of retries to spin up the disk to operating speed if the first attempt failedRAW_VALUE> 10
197 Current_Pending_SectorNumber of sectors in the waiting queue for reallocationRAW_VALUE> 0
198 Offline_UncorrectableNumber of sectors on the disk that the disk controller tried to fix by itselfRAW_VALUE> 0

Replace a defective disk

The failure of a disk can be determined by checking the disk status. If the disk is found to be defective after assessing the SMART attributes, you can initiate its replacement. To do this:

  1. Obtain the serial number of the defective disk.
  2. Coordinate the disk replacement.
  3. If the disk is added to a RAID array, remove the disk from the RAID array.
  4. Illuminate the disk.
  5. Check the disk in the system.
  6. If the disk was in a RAID array, add the disk to the RAID array.

1. Obtain the serial number of the defective disk

  1. Connect to the server via SSH or via KVM console.

  2. Obtain the serial number of the defective disk; to do this, display information about the disks:

    lsblk -o name,serial,model

    The response will contain information about the disks. Copy the serial number of the defective disk. For example:

    NAME SERIAL MODEL
    sdb S0H0N0XYZ123456 Samsung SSD 970 EVO Plus 500GB
    nvme0n1 S0D0NX0M001234 Samsung SSD 980 PRO 1TB

    Here SERIAL is the disk serial number.

2. Coordinate the disk replacement

  1. Create a ticket. In the ticket, specify:

  2. If the disk replacement is approved, a Servercore employee will specify a convenient time and the duration of the work. The duration of the work will be required to determine the time for illuminating the disk.

3. Remove the disk from the RAID array

If the disk is in a RAID array, remove the disk from the array.

4. Illuminate the disk

At the time designated for the work, we will inform you in the ticket that we are ready to proceed with the disk replacement.

If the disk cannot be illuminated and the engineers are unable to identify it by its serial number, then the server will need to be shut down to replace the disk. In this case, we will report the issue with disk identification and arrange a server shutdown time in the ticket.

To illuminate the disk, create a load on it, for example, by running a write or read operation. If you remove the disk while these operations are being performed, read errors will occur. This is normal behavior, as the command attempts to access data on a disk that has already been removed.

  1. Connect to the server via SSH or via KVM console.

  2. Output the disk information:

    lsblk

    The response will contain information about the disks. Remember or copy the disk name. For example:

    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
    sda 8:0 0 1.8T 0 disk
    └─sda1 8:1 0 1.8T 0 part /mnt/data
    sdb 8:16 0 931.5G 0 disk
    └─sdb1 8:17 0 931.5G 0 part /mnt/backup
    nvme0n1 259:0 0 465.8G 0 disk
    ├─nvme0n1p1 259:1 0 512M 0 part /boot/efi
    ├─nvme0n1p2 259:2 0 16G 0 part [SWAP]
    └─nvme0n1p3 259:3 0 449.3G 0 part /

    Here sda, sdb, nvme0n1 are the disk names.

  3. Illuminate the disk:

    dd if=/dev/<disk_name> of=/dev/null

    Specify <disk_name> — the disk name you copied in step 2.

5. Check the disk in the system

  1. Wait for a message from a Servercore employee in the ticket confirming that the disk has been replaced.

  2. Connect to the server via SSH or via KVM console.

  3. Make sure the disk has initialized in the system:

    lsblk
  4. If the disk is missing from the list, restart the server. If the disk has not initialized in the system after the restart, inform us in the ticket.

6. Add the disk to the RAID array

If the disk was in a RAID array, add the replaced disk to the array.