Skip to main content
Diagnose and replace the defective disk

Diagnose and replace the defective disk

You can disk check using SMART (Self-Monitoring, Analysis and Reporting Technology) attributes. If the test results show that the drive is faulty, you can replace a defective disk.

Check disk condition

  1. Get SMART attributes.
  2. Evaluate the values of the SMART attributes.

1. Get SMART attributes

The method of obtaining SMART attributes depends on the operating system installed on the server and the way the disk is connected to the server:

  • without RAID controller — the disk is connected directly to the motherboard or through an HBA controller;
  • via RAID controller — the disk is connected via an Adaptec or MegaRAID controller installed on the server.
  1. Connect to the server via SSH or through KVM console.

  2. Install the package smartmontools — is a set of utilities for monitoring the state of HDD disks and SSD drives that support SMART technology.

    apt-get install smartmontools
  3. Output information about the disks connected to the server:

    lsblk

    Disk information will appear in the response. Memorize or copy the disk IDs. For example:

    NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
    sda 8:0 0 1.8T 0 disk
    └─sda1 8:1 0 1.8T 0 part /mnt/data
    sdb 8:16 0 931.5G 0 disk
    └─sdb1 8:17 0 931.5G 0 part /mnt/backup
    nvme0n1 259:0 0 465.8G 0 disk
    ├─nvme0n1p1 259:1 0 512M 0 part /boot/efi
    ├─nvme0n1p2 259:2 0 16G 0 part [SWAP]
    └─nvme0n1p3 259:3 0 449.3G 0 part /

    Here. sda, sdb, nvme0n1 — disk IDs.

  4. Start reading SMART attributes. The command to run depends on the disk interface:

    • for SATA:
    smartctl -iA /dev/<disk_id>

    Specify <disk_id> — ID of the disk you copied in step 3.

    • for NVME:
    nvme smart-log /dev/<disk_id>

    Specify <disk_id> — ID of the disk you copied in step 3.

2. assess SMART attributes

A disk is considered faulty if at least one of the SMART attributes fits the specified conditions.

Attribute DescriptionFieldAttribute value
5 Reallocated_Sector_CtNumber of sectors reassigned due to errorsRAW_VALUE> 0
7 Seek_Error_RateError rate for positioning the head unitVALUE< 45
9 Power_on_hours.Hours workedRAW_VALUE> 43800
10 Spin_Retry_CountNumber of repeated attempts to spin up disks to operating speed in case the first attempt was unsuccessfulRAW_VALUE> 10
197 Current_Pending_SectorNumber of sectors in the reassignment queueRAW_VALUE> 0
198 Offline_UncorrectableNumber of sectors on the disk that the disk controller tried to fix on its ownRAW_VALUE> 0

Replace a defective disk

A disk malfunction can be determined by disk health checks. If, as a result of SMART attribute assessments The disk is defective, you can initiate a replacement. To do so:

  1. Get the serial number of the defective disk.
  2. Coordinate disk replacement.
  3. Remove a disk from the RAID array.
  4. Illuminate the disk.
  5. Check the disk in the system.
  6. Add a disk to a RAID array.

1. Get the serial number of the defective disk

  1. Connect to the server via SSH or through KVM console.

  2. Get the serial number of the faulty disk, to do this, print the disk information:

    lsblk -o name,serial,model

    Disk information will appear in the response. Copy the serial number of the failed disk. For example:

    NAME    SERIAL            MODEL
    sdb S0H0N0XYZ123456 Samsung SSD 970 EVO Plus 500GB
    nvme0n1 S0D0NX0M001234 Samsung SSD 980 PRO 1TB

    Here. SERIAL — the serial number of the disk.

2. Coordinate disk replacement

  1. Create a ticket. In the ticket, specify:

  2. If a disk replacement is agreed upon, a Servercore staff member will specify a convenient time and duration for you. The duration of the work will be required to determine the time disk lights.

3. Remove the disk from the RAID array

If the disk is in a RAID array, remove the disk from the array.

4. Illuminate the disk

At the time scheduled for the work, we will notify you in a ticket that we are ready to proceed with the disk replacement.

If the disk fails to illuminate and the engineers cannot identify it by serial number, we will need to shut down the server to replace the disk. In this case, we will report the problem when identifying the disk and agree on a time to shut down the server in the ticket.

To light a disk, put a load on the disk, such as a write or read operation. If you eject the disk while these operations are in progress, there will be read errors. This is normal behavior because the command is trying to access data on a disk that has already been ejected.

  1. Connect to the server via SSH or through KVM console.

  2. Output information about the disks connected to the server:

    lsblk

    Disk information will appear in the response. Memorize or copy the disk ID. For example:

    NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
    sda 8:0 0 1.8T 0 disk
    └─sda1 8:1 0 1.8T 0 part /mnt/data
    sdb 8:16 0 931.5G 0 disk
    └─sdb1 8:17 0 931.5G 0 part /mnt/backup
    nvme0n1 259:0 0 465.8G 0 disk
    ├─nvme0n1p1 259:1 0 512M 0 part /boot/efi
    ├─nvme0n1p2 259:2 0 16G 0 part [SWAP]
    └─nvme0n1p3 259:3 0 449.3G 0 part /

    Here. sda, sdb, nvme0n1 — disk IDs.

  3. Light up the disk:

    dd if=/dev/<disk_id> of=/dev/null

    Specify <disk_id> — ID of the disk you copied in step 2.

5. Check the disk in the system

  1. Wait for a message on the ticket that the disk has been replaced.

  2. Connect to the server via SSH or through KVM console.

  3. Verify that the drive has initialized to the system:

    lsblk
  4. If the disk is not in the list, reboot the server. If after rebooting the disk did not initialize in the system, report it in the ticket.

6. Add a disk to a RAID array

If the disk was in a RAID array, add the replaced disk to the array.