Battle testing ZFS, Btrfs and mdadm+dm-integrity

Published on 2019-05-05. Modified on 2020-01-23.

In this article I share the results of a home-lab experiment in which I threw some different problems at ZFS, Btrfs and mdadm+dm-integrity in a RAID-5 setup.

Introduction
Myths and misunderstandings
Some advice
ZFS RAID-Z
Btrfs RAID-5
mdadm+dm-integrity RAID-5
Final notes
Relevant reading

Introduction

Let me start by saying that this is a simple write up and it wasn't originally intended to be anything but some personal notes which I then decided to share.

I did my tests on and off during the course of about a week, but I have tried to be consistent. I have repeated many of the tests more than once, but did not document everything as the results where very similar.

My main interest was to see how the different systems would handle multiple breakdown situations in a RAID-5 setup. I also tested mirror (RAID-1) setups, but due to the length of the article I later decided not to include those.

I have used the word "pool" from the world of ZFS and Btrfs whenever I am dealing with the RAID-5 array.

Please forgive any short comings, missing parts, and mistakes in my attempt to make this write up. Also English is not my native language.

Now, on to the subject.

Whenever we use any kind of precautionary measures against data corruption, such as backup and/or filesystem data integrity verification, we need to test our setup with at least some simulated failures before we implement a solution. If we never "battle test" our solution, we have no real idea how it's going to handle a breakdown.

We need to ask questions like:

If my system breaks down right now do I have adequate measures in place or will I lose important data?
If I do have backup in place, will my backup suffice? Is it recent enough? Is it secure enough?
What if my backup solution breaks during restoration? Do I need multiple backup solutions?
What about bit rot?
Do I need running data integrity verification?
Do I need to backup everything or can I perhaps split data into important and non-important categories?
Do I need to automate some of the procedures?
Have I tested my solution?
Have I tested my solution?
Have I tested my solution?

Yes, we really need to test our solutions throughly :)

In any case, ZFS and Btrfs are both very amazing open source data integrity verification filesystems.

I have more experience using ZFS, and the last time I tested Btrfs, it was not performing well. File transfers where slow and a situation did occur where I lost some files. However, this was a very long time ago. I have since looked through the Btrfs source code and commit logs and Btrfs has received many fixes and improvements - especially during the last couple of years.

I therefore decided to put up a simple home-test environment with bare metal and throw some simulated problems against both ZFS and Btrfs and then try to deal with the problems in an as-identical-as-possible manner on both systems. Later I added mdadm+dm-integrity.

I managed to get a long with all the systems using only their respective man pages even though I think the Btrfs documentation could benefit a lot by using some examples.

I used old and cheap hardware suitable for a home-lab.

The computer I used has only 4 SATA-II connectors and I decided to use one for the boot device itself. I then used the rest for a RAID-5 (RAID-Z in ZFS) with just three hard drives. I could have booted from a USB stick and then used four drives, but I wanted to speed up both the installation time and boot time.

A RAID-5 requires 3 or more physical drives. RAID-5 stores parity blocks distributed across each disk. In the event of a failed disk, these parity blocks are used to reconstruct the data on a replacement disk. RAID-5 can withstand the loss of one disk.

I know some people "frown" upon RAID-5, but a RAID-5 is a really great way to utilize both speed and space. In any case no kind of RAID setup is a replacement for proper backup. If your data is important to you, you should always back it up.

On the picture below I have setup two identical computers. During the main testing I always used the same machine and hardware, but for extensive and repeated testing I put the second machine to work too.

HP dc7900

Both computers are equipped with an Intel Core 2 Duo CPU E8400 3.00 Ghz CPU with 8 GB of memory and an Intel Pro 1000 PT PCIe x1 Gigabit NIC. All the hard drives are some really amazing but old 1 TB Seagate Barracude ES.2 drives from 2010 (with the latest firmware) that all have been through quite a lot of "beating" over the years. I believe I have about ten of these drives and (if I remember correctly) only one drive has failed, about a year ago. The rest is still going strong.

For ZFS I ran Debian Linux "Stretch" with kernel version 4.19.28 and zfs-dkms version 0.7.12-1 from backports. For Btrfs I ran Arch Linux with kernel version 5.0.9 and Btrfs version 4.20.2. For both ZFS and Btrfs I used Samba. In my experience Samba performs better than NFS even between Linux-only machines and even though NFS uses less resources, I prefer Samba for various reasons.

At some point during my testing with Btrfs I discovered dm-integrity. I therefore decided to setup a RAID-5 with mdadm+dm-integrity on the Arch Linux installation and repeat the tests.

During the process I sometimes jumped back and forth between different tests on the different systems. For example, I first tested ZFS, then repeated the tests with Btrfs, then began testing mdadm+dm-integrity, then went back and performed some more tests on both ZFS and Btrfs, etc. The article is therefor put together from the various tests and the date and time in the different terminal outputs don't always match up. I also sometimes changed the disks in my setup so the disk IDs occasionally change. Please ignore that.

Myths and misunderstandings

One thing that really bothers me is how much false information that exists on the Internet regarding both ZFS and Btrfs.

Some misinformation has been spread due to inexperience, wrong expectations, and/or misunderstandings about the usage of these systems.

Let's get some of the myths and misunderstandings out of the way:

Myth: ZFS requires tons of memory!
This is one of the biggest misunderstandings about ZFS. The only situation in which ZFS requires lots of memory is if you specifically use de-duplication. I have run ZFS successfully using FreeBSD 12 on a Raspberry Pi 3 with two 1 TB USB disks attached to a single USB 3 hub. ZFS never used more than half the memory available during any kind of procedure and you can change the settings so it even runs with much less than that.
Myth: Red Hat has removed Btrfs because they consider it useless!
No, that is not why Red Hat removed Btrfs. A former Red Hat developer explains the situation on Hacker News.
Myth: ZFS and Btrfs requires ECC memory!
ZFS or Btrfs without ECC memory is no worse than any other file system without ECC memory. Using ECC memory is recommended in situations where the strongest data integrity guarantees are required. Random bit flips caused by cosmic rays or by faulty memory can go undetected without ECC memory. Any filesystem will write the damaged data from memory to disk and be unable to automatically detect the corruption. Also note that ECC memory is often not supported by consumer grade hardware. And ECC memory is also more expensive. In any way you can run ZFS and Btrfs without using ECC memory, it's not a requirement.
Myth: Restoring a RAID-5 puts more stress on the drives!
Drives are not stressed! It's their job to read and write data! You are using your drives, not stressing them. It takes longer to restore a RAID-5 because the parity data needs to be calculated using CPU and it is slower than simply copying data between disks in a mirror (RAID-1), but there is no stress involved.
Myth: Using USB disk devices with ZFS or Btrfs is okay!
Sometimes you can get away with it without any problems what so ever, but many USB controllers and USB storage devices are really bad. If things break you cannot blame the filesystem. On Btrfs a Parent transid verify failed error is often the result of a failed internal consistency check of the filesystem's metadata due to a bad USB storage device. Other issues such as automatic and sudden un-mounting, wrong file size, data corruption, sudden shutdown, and several other problems are often caused by a bad USB storage device and/or USB power issues.
Myth: Btrfs still has the write hole issue and is completely useless!
The myth part of this is that Btrfs is completely useless, not the problems with the write hole issue. As of writing Btrfs still has some issues, but it is definitely not useless and you can even run RAID5/6 if you take some specific precautions. Check the RAID5/6 information. The "write hole" problem with Btrfs only potentially exist if you experience a power loss (an unclean shutdown) while having a disk that is failing immediately thereafter (or possibly at the same time) - without running a scrub in between. These two distinct failures combined breaks the Btrfs RAID-5 redundancy. However I was not able to reproduce the problem in any of my many tests with Btrfs. Update 2020-01-23: People have been emailing me with examples of the write hole problem persisting, where they have lost data, even in the Btrfs version of the 5.x kernel.
Myth: Btrfs is abandoned!
Btrfs is used in production world wide. Btrfs is deployed by Facebook on millions of servers with significant effiency gains. And it is also used by many other companies and projects and Btrfs keeps getting better and better.
Myth: mdadm+XYZ can replace ZFS or Btrfs!
No. They don't even compare.

Some advice

Most data loss reported on the mailing lists of ZFS, Btrfs, and mdadm, is down to user error while attempting to recover a failed array. Never use a trial-and-error approach when something goes wrong with your filesystem or backup solution!

Very often a really bad situation is caused by a trial-and-error approach to a problem. With Btrfs many people immediately use the btrfs check --repair command when they experience an issue, but this is actually the very last command you want to run.

Understand what you can expect from the filesystem you're using, how it works, and how each system implements a specific functionality. Don't blame the filesystem when it doesn't fulfill your wrong expectations.

ZFS RAID-Z

Let's begin the testing with ZFS.

The three disks are listed by "by-id" and I'll create the ZFS pool using those ID's as they also contain the serial number which makes it very easy to identify each drive.

$ ls -gG /dev/disk/by-id/
ata-ST31000340NS_9QJ089LF -> ../../sdd
ata-ST31000340NS_9QJ0EQ1V -> ../../sdb
ata-ST31000340NS_9QJ0F2YQ -> ../../sdc

With a RAID-Z (RAID-5) I can stand to lose one drive and the pool will still function, however I need to "resilver" the pool as soon as possible with a replacement drive.

Resilvering is the same concept as rebuilding a RAID array. With most other RAID implementations, there is no distinction between which blocks are in use, and which aren't. A typical rebuild therefore starts at the beginning of the disk until it reaches the end of the disk - this is how mdadm works and it is extremely slow. But because ZFS knows about structure of the RAID system and the metadata, ZFS rebuilds only the blocks in use. The ZFS developers therefore thought of the term "resilvering" rather than "rebuilding".

I'm going to create a pool using the -f option because ZFS will detect that the attached drives used to belong to an old pool and will not allow for it to be used in a new pool unless forced to do so (I have used the drives in a previous setup).

# zpool create -f -O xattr=sa -O dnodesize=auto -O atime=off -o ashift=12 pool1 raidz ata-ST31000340NS_9QJ0F2YQ ata-ST31000340NS_9QJ0EQ1V ata-ST31000340NS_9QJ089LF

I'm then going to create a ZFS dataset on the pool with lz4 compression enabled.

# zfs create -o compress=lz4 pool1/pub
# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
pool1       575K  1.75T   128K  /pool1
pool1/pub   128K  1.75T   128K  /pool1/pub

I have then exported the "pub" directory using Samba and will begin by copying some files over from a client computer using rsync.

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/
1.pdf
     18,576,345 100%  196.49MB/s    0:00:00 (xfr#1, to-chk=6/8)
2.pdf
     30,255,102 100%   70.89MB/s    0:00:00 (xfr#2, to-chk=5/8)
3.pdf
     22,016,195 100%   23.28MB/s    0:00:00 (xfr#3, to-chk=4/8)
bar.mkv
 35,456,180,485 100%  112.92MB/s    0:04:59 (xfr#4, to-chk=3/8)
boo.iso
    625,338,368 100%   21.64MB/s    0:00:27 (xfr#5, to-chk=2/8)
foo.mkv
  1,548,841,922 100%  135.76MB/s    0:00:10 (xfr#6, to-chk=1/8)
moo.iso
    415,633,408 100%   25.86MB/s    0:00:15 (xfr#7, to-chk=0/8)

Number of files: 8 (reg: 7, dir: 1)
Number of created files: 8 (reg: 7, dir: 1)
Number of deleted files: 0
Number of regular files transferred: 7
Total file size: 38,116,841,825 bytes
Total transferred file size: 38,116,841,825 bytes
Literal data: 38,116,841,825 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 38,126,148,150
Total bytes received: 202

sent 38,126,148,150 bytes  received 202 bytes  106,945,717.68 bytes/sec
total size is 38,116,841,825  speedup is 1.0

Now the ZFS pool has some data:

# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
pool1      35.5G  1.72T   128K  /pool1
pool1/pub  35.5G  1.72T  35.5G  /pool1/pub

ZFS - Power outage

I'll then add yet another file using rsync and then pull the power cord to the ZFS machine half way through the transfer.

I have then aborted the rest of the file transfer on the client and turned the ZFS machine back on.

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/
sending incremental file list
zoo.mkv
  5,918,261,248  54%   64.88kB/s   21:11:16
^C

Because ZFS is using transactional transfers the file is going to be lost, but nothing has happened to the files already on the system, and there will be no kind of damage to the filesystem and no kind of filesystem checking needs to be run.

Let's take a look at the ZFS documentation from ORACLE regarding the Transactional Semantics:

ZFS is a transactional file system, which means that the file system state is always consistent on disk. Traditional file systems overwrite data in place, which means that if the system loses power, for example, between the time a data block is allocated and when it is linked into a directory, the file system will be left in an inconsistent state. Historically, this problem was solved through the use of the fsck command. This command was responsible for reviewing and verifying the file system state, and attempting to repair any inconsistencies during the process. This problem of inconsistent file systems caused great pain to administrators, and the fsck command was never guaranteed to fix all possible problems. More recently, file systems have introduced the concept of journaling. The journaling process records actions in a separate journal, which can then be replayed safely if a system crash occurs. This process introduces unnecessary overhead because the data needs to be written twice, often resulting in a new set of problems, such as when the journal cannot be replayed properly.

With a transactional file system, data is managed using copy on write semantics. Data is never overwritten, and any sequence of operations is either entirely committed or entirely ignored. Thus, the file system can never be corrupted through accidental loss of power or a system crash. Although the most recently written pieces of data might be lost, the file system itself will always be consistent. In addition, synchronous data (written using the O_DSYNC flag) is always guaranteed to be written before returning, so it is never lost.

This is confirmed by a look at the status of the pool:

# zpool status
  pool: pool1
 state: ONLINE
  scan: none requested
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     0
            ata-ST31000340NS_9QJ0F2YQ  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0EQ1V  ONLINE       0     0     0
            ata-ST31000340NS_9QJ089LF  ONLINE       0     0     0

errors: No known data errors

# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
pool1      35.5G  1.72T   128K  /pool1
pool1/pub  35.5G  1.72T  35.5G  /pool1/pub

And from the clients point of view:

$ ls -gG mnt/testbox/pub/tmp
total 37194300
-rwxr-xr-x 1    18576345 Apr 21 09:08 1.pdf
-rwxr-xr-x 1    30255102 Apr 21 09:08 2.pdf
-rwxr-xr-x 1    22016195 Apr 21 09:08 3.pdf
-rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv
-rwxr-xr-x 1   625338368 Mar  5  2018 boo.iso
-rwxr-xr-x 1  1548841922 Apr 15 23:50 foo.mkv
-rwxr-xr-x 1   415633408 Mar  5  2018 moo.iso

ZFS - Drive failure

Now I want to simulate a simple drive failure. I'm going to remove one of the drives from the ZFS machine, then replace it with another drive, and then resilver the ZFS pool.

I have removed the drive:

# zpool status
  pool: pool1
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          DEGRADED     0     0     0
          raidz1-0                     DEGRADED     0     0     0
            ata-ST31000340NS_9QJ0F2YQ  ONLINE       0     0     0
            1803500998269517419        UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST31000340NS_9QJ0EQ1V-part1
            ata-ST31000340NS_9QJ089LF  ONLINE       0     0     0

errors: No known data errors

# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
pool1      35.5G  1.72T   128K  /pool1
pool1/pub  35.5G  1.72T  35.5G  /pool1/pub

Even though the pool is in a degraded state, I can still mount the pool on the client and use the files.

$ mount mnt/testbox/pub
$ ls -gG mnt/testbox/pub/tmp
total 37194300
-rwxr-xr-x 1    18576345 Apr 21 09:08 1.pdf
-rwxr-xr-x 1    30255102 Apr 21 09:08 2.pdf
-rwxr-xr-x 1    22016195 Apr 21 09:08 3.pdf
-rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv
-rwxr-xr-x 1   625338368 Mar  5  2018 boo.iso
-rwxr-xr-x 1  1548841922 Apr 15 23:50 foo.mkv
-rwxr-xr-x 1   415633408 Mar  5  2018 moo.iso

I can also write to the pool.

$ echo Hello > mnt/testbox/pub/tmp/hello.txt
$ ls -gG mnt/testbox/pub/tmp/
total 37194304
-rwxr-xr-x 1    18576345 Apr 21 09:08 1.pdf
-rwxr-xr-x 1    30255102 Apr 21 09:08 2.pdf
-rwxr-xr-x 1    22016195 Apr 21 09:08 3.pdf
-rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv
-rwxr-xr-x 1   625338368 Mar  5  2018 boo.iso
-rwxr-xr-x 1  1548841922 Apr 15 23:50 foo.mkv
-rwxr-xr-x 1           6 Apr 24 23:11 hello.txt
-rwxr-xr-x 1   415633408 Mar  5  2018 moo.iso

Now I need to identify the new drive:

$ ls -l /dev/disk/by-id/
ata-ST31000340NS_9QJ0DVN2 -> ../../sdb

Then I need to replace the old drive with the new. The procedure, since the old drive is completely gone, is not to detach and then replace, but simply to replace with zpool replace pool old_device new_device.

# zpool replace pool1 ata-ST31000340NS_9QJ0EQ1V ata-ST31000340NS_9QJ0DVN2

ZFS will immediately and automatically begin the resilvering of the pool:

# zpool status
  pool: pool1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Apr 24 23:19:33 2019
        10.5G scanned out of 53.3G at 228M/s, 0h3m to go
        3.49G resilvered, 19.68% done
config:

        NAME                             STATE     READ WRITE CKSUM
        pool1                            DEGRADED     0     0     0
          raidz1-0                       DEGRADED     0     0     0
            ata-ST31000340NS_9QJ0F2YQ    ONLINE       0     0     0
            replacing-1                  DEGRADED     0     0     0
              1803500998269517419        UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST31000340NS_9QJ0EQ1V-part1
              ata-ST31000340NS_9QJ0DVN2  ONLINE       0     0     0  (resilvering)
            ata-ST31000340NS_9QJ089LF    ONLINE       0     0     0

errors: No known data errors

After about 3 minutes the pool is back up and ready for usage:

# zpool status
  pool: pool1
 state: ONLINE
  scan: resilvered 17.8G in 0h3m with 0 errors on Wed Apr 24 23:22:56 2019
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     0
            ata-ST31000340NS_9QJ0F2YQ  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0DVN2  ONLINE       0     0     0
            ata-ST31000340NS_9QJ089LF  ONLINE       0     0     0

errors: No known data errors

And from the client:

$ ls -gG mnt/testbox/pub/tmp
total 37194304
-rwxr-xr-x 1    18576345 Apr 21 09:08 1.pdf
-rwxr-xr-x 1    30255102 Apr 21 09:08 2.pdf
-rwxr-xr-x 1    22016195 Apr 21 09:08 3.pdf
-rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv
-rwxr-xr-x 1   625338368 Mar  5  2018 boo.iso
-rwxr-xr-x 1  1548841922 Apr 15 23:50 foo.mkv
-rwxr-xr-x 1           6 Apr 24 23:11 hello.txt
-rwxr-xr-x 1   415633408 Mar  5  2018 moo.iso

Just to make sure all data has been resilvered without any errors during writing I'll perform a scrub and validate that everything is alright:

# zpool scrub pool1

And about 3 minutes after the scrub is finished:

# zpool status
  pool: pool1
 state: ONLINE
  scan: scrub repaired 0B in 0h3m with 0 errors on Thu Apr 24 23:56:01 2019
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     0
            ata-ST31000340NS_9QJ0F2YQ  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0DVN2  ONLINE       0     0     0
            ata-ST31000340NS_9QJ089LF  ONLINE       0     0     0

errors: No known data errors

Since ZFS has only restored the used data blocks, not the entire disk, the procedure was very was as was the scrubbing.

ZFS - Drive failure during file transfer

Now I want to remove a disk in the middle of an active file transfer in order to simulate a total failure of a disk, but not a permanent failure. This might happen if the disk power cord managed to wiggle itself loose, or if the disk is located in a slot and hasn't been pushed all the way through, etc.

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/
sending incremental file list
zoo.mkv
 10,867,033,488 100%  127.95MB/s    0:01:20 (xfr#1, to-chk=0/9)

Number of files: 9 (reg: 8, dir: 1)
Number of created files: 1 (reg: 1)
Number of deleted files: 0
Number of regular files transferred: 1
Total file size: 48,983,875,313 bytes
Total transferred file size: 10,867,033,488 bytes
Literal data: 10,867,033,488 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 10,869,686,827
Total bytes received: 38

sent 10,869,686,827 bytes  received 38 bytes  102,062,787.46 bytes/sec
total size is 48,983,875,313  speedup is 4.5

I removed the drive by disconnection the individual power cord to the drive. The ZFS machine reacted by halting the file transfer for about a second, then it resumed with full speed and the client only experienced a momentary drop in the file transfer speed.

The file transfer was then completed without any problems on the client side.

On the ZFS machine the pool has now changed the state to DEGRADED:

# zpool status
  pool: pool1
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          DEGRADED     0     0     0
          raidz1-0                     DEGRADED     0     0     0
            ata-ST31000340NS_9QJ0ES1V  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0ET8D  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0EZZC  UNAVAIL      0     0     0

errors: No known data error

I powered down the machine in order to safely reattach the drive and then rebooted.

ZFS has detected the error:

# zpool status
  pool: pool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: none requested
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     0
            ata-ST31000340NS_9QJ0ES1V  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0ET8D  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0EZZC  ONLINE       0     0    11

errors: No known data errors

This situation simulates a physical drive failure when the ZFS pool is under active use and it is probably one of the most common situations in real life.

In order to handle the problem correctly I would normally need to investigate the situation.

Has the drive physically failed and therefore needs a replacement?
Or is it perhaps a wire that has managed to wiggle itself loose?
Or is it perhaps the wire itself that is broken?
Or has the disk connector (both on the disk itself and on the motherboard) experienced any physical corrosion? (This actually happens).

It's important to remember that whether a disk is good or bad is not a simple yes or no question. A disk can be "mostly" good, with a few physical sectors that give errors. A disk can be bad for a few seconds, hours, or days, and then go back to working fine again for years.

Due to firmware issues, a disk may be able to do most operations fine, but certain operations don't work well. Disk problems are shaded, multi-dimensional and time-dependent!

Now, since this is just a simulation I know what to do, but in a real life situation you need to investigate the above questions as any of the above issues might be the cause of the problem.

If there isn't any physical problems with the setup you might be able to get some useful information from S.M.A.R.T.

In my situation I have determined that the problem was caused by a system administrator who managed to pull the power cord from the disk "by mistake" so I don't need to replace the drive :)

The correct approach is therefore to do a scrub after the drive has been reattached. During a scrub ZFS will detect any checksum errors and will restore the data using the parity data.

# zpool scrub pool1
# zpool status
  pool: pool1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 4.19G in 0h6m with 0 errors on Fri Apr 26 01:32:23 2019
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          DEGRADED     0     0     0
          raidz1-0                     DEGRADED     0     0     0
            ata-ST31000340NS_9QJ0ES1V  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0ET8D  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0EZZC  DEGRADED     0     0 67.2K  too many errors

errors: No known data error

After the scrubbing is done ZFS tells us that it has repaired 4.19GB of data with 0 errors.

Even though ZFS has managed to repair everything without any errors it still keeps the pool in a degraded state because it is up to the system administrator to decide what needs to be done. This is important because even though ZFS has managed to rescue all data we might still be dealing with am unhealthy device.

Had there been any unrecoverable errors during the scrubbing we would be facing a disk that is too damaged for ZFS to continue working with it.

Can we clear the log and bring the pool status into the ONLINE and healthy state? Or do we need to replace the drive anyway? Perhaps S.M.A.R.T has warned us that the drive is now currently working, but it is experiencing occasional issues and soon needs to be fully replaced.

In this case we know that the drive is working fine so I'll just clear the log:

# zpool clear pool1
# zpool status
  pool: pool1
 state: ONLINE
  scan: scrub repaired 0B in 0h3m with 0 errors on Fri Apr 26 02:09:44 2019
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     0
            ata-ST31000340NS_9QJ0ES1V  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0ET8D  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0EZZC  ONLINE       0     0     0

errors: No known data error

As a side note I can mention that I have been working a lot with tons of hardware over the past 25+ years and I have seen several situations in which S.M.A.R.T has reported problems with drives that still went on going for many years after being reported as both old and worn out. Of course you cannot ignore such reports, but depending on the situation you might need to replace the drive, but it can still be used in a less important capacity.

ZFS - Data corruption during file transfer

Now I want to simulate data corruption in the middle of a file transfer from the client. Not a drive failure, but some corruption of the data located on the pool.

I have removed the "zoo.mkv" file and while the rsync command is running again I'll do a couple of dd commands on the ZFS machine on one of the drives.

# dd if=/dev/urandom of=/dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V seek=100000 count=1000 bs=1k

While the transfer is still running, I'm checking the pool status:

# zpool status
  pool: pool1
 state: ONLINE
  scan: none requested
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     1
            ata-ST31000340NS_9QJ0ES1V  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0ET8D  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0EZZC  ONLINE       0     0     0

errors: No known data errors

ZFS shows a checksum issue which has been fixed. dmesg and the log currently doesn't provide any further information. But we can take a look at the zpool events -v command if we want further information:

# zpool events -v
Apr 26 2019 23:05:59.990726744 ereport.fs.zfs.checksum
        class = "ereport.fs.zfs.checksum"
        ena = 0x18549e4f2ec00401
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x4cdea36f1d7afa7c
                vdev = 0x772f5157f66ae182
        (end detector)
        pool = "pool1"
        pool_guid = 0x4cdea36f1d7afa7c
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        vdev_guid = 0x772f5157f66ae182
        vdev_type = "disk"
        vdev_path = "/dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1"
        vdev_ashift = 0xc
        vdev_complete_ts = 0x18549e3862a
        vdev_delta_ts = 0x19c5648
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x0
        parent_guid = 0x3a01d1f81d93aaf8
        parent_type = "raidz"
        vdev_spare_paths =
        vdev_spare_guids =
        zio_err = 0x34
        zio_flags = 0x100080
        zio_stage = 0x100000
        zio_pipeline = 0xf80000
        zio_delay = 0x0
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_offset = 0x5cb6000
        zio_size = 0x6000
        zio_objset = 0x48
        zio_object = 0x82
        zio_level = 0x1
        zio_blkid = 0x0
        bad_ranges = 0x0 0x6000
        bad_ranges_min_gap = 0x8
        bad_range_sets = 0xcaa5
        bad_range_clears = 0xb597
        bad_set_histogram = 0x32c 0x32b 0x334 0x312 0x334 0x31c 0x306 0x31f 0x300 0x340 0x303 0x30d 0x330 0x318 0x324 0x2f0 0x304 0x32b 0x314 0x33c 0x339 0x2fd 0x33c 0x347 0x33c 0x379 0x33f 0x324 0x327 0x351 0x310 0x313 0x31f 0x31c 0x31e 0x334 0x354 0x32e 0x33e 0x312 0x32d 0x369 0x340 0x337 0x32a 0x330 0x32c 0x33a 0x319 0x328 0x30a 0x332 0x32a 0x320 0x333 0x333 0x34b 0x316 0x347 0x30c 0x34c 0x35a 0x34a 0x2ff
        bad_cleared_histogram = 0x2cc 0x2c3 0x2e2 0x2ca 0x29c 0x2fa 0x2f8 0x2d0 0x2e6 0x2cd 0x2d5 0x2c3 0x2bf 0x2d7 0x2d7 0x2fa 0x2c8 0x2d4 0x2d1 0x303 0x2ef 0x2fa 0x2f4 0x2c1 0x2a3 0x2b7 0x2b4 0x2e9 0x2e6 0x2c9 0x2d9 0x2eb 0x2c1 0x2b9 0x2e4 0x2d7 0x2c0 0x2ff 0x2c7 0x2dc 0x2e8 0x2bc 0x2c7 0x2d8 0x2ed 0x2db 0x2db 0x318 0x2e8 0x2c8 0x2db 0x2da 0x2de 0x2f7 0x2d0 0x2e6 0x2ae 0x2fb 0x2ca 0x2d5 0x2a9 0x2d2 0x2e2 0x2aa
        time = 0x5cc372b7 0x3b0d4a58
        eid = 0x1f

ZFS events have never been publicly documented, but we do know from the above output that we have had some bad bits cleared out and that everything is in perfect order.

ZFS - The dd mistake

Have you ever made the mistake of running the rm -rf command as the root user on the / path of your disk? Or even worse what about the dd command?

I want to extent the above test and see what's going to happen if I by mistake type the dd command and let it run for a while during a file transfer from the client.

I have deleted all the files, restarted rsync, and I am now letting the dd run:

# dd if=/dev/urandom of=/dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V bs=1k
^C348001+0 records in
348001+0 records out
356353024 bytes (356 MB, 340 MiB) copied, 47.1212 s, 7.6 MB/s

This should make a big mess of things.

Nothing noticeable has happened on the client:

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/
sending incremental file list
./
1.pdf
     18,576,345 100%  178.63MB/s    0:00:00 (xfr#1, to-chk=7/9)
2.pdf
     30,255,102 100%   76.33MB/s    0:00:00 (xfr#2, to-chk=6/9)
3.pdf
     22,016,195 100%   28.68MB/s    0:00:00 (xfr#3, to-chk=5/9)
bar.mkv
 14,681,931,776  41%  112.62MB/s    0:03:00

ZFS has detected the errors:

# zpool status
  pool: pool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: none requested
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     0
            ata-ST31000340NS_9QJ0ES1V  ONLINE       0     0     1
            ata-ST31000340NS_9QJ0ET8D  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0EZZC  ONLINE       0     0     0

errors: No known data error

This shows the absolute incredible and unmatchable resilience of ZFS. Even though I just started doing a dd on one of the drives, the filesystem keeps working and clients can still read and write from the pool.

All I need to do is to perform a scrub to fix the problems:

# zpool scrub pool1
# zpool status
  pool: pool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub in progress since Tue Apr 30 01:35:56 2019
        2.24G scanned out of 68.5G at 209M/s, 0h5m to go
        28K repaired, 3.28% done
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     0
            ata-ST31000340NS_9QJ0ES1V  ONLINE       0     0     9  (repairing)
            ata-ST31000340NS_9QJ0ET8D  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0EZZC  ONLINE       0     0     0

errors: No known data errors

And the result:

# zpool status
  pool: pool1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 60K in 0h3m with 0 errors on Tue Apr 30 01:39:50 2019
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          DEGRADED     0     0     0
          raidz1-0                     DEGRADED     0     0     0
            ata-ST31000340NS_9QJ0ES1V  DEGRADED     0     0    17  too many errors
            ata-ST31000340NS_9QJ0ET8D  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0EZZC  ONLINE       0     0     0

errors: No known data errors

Again we have to investigate in order to determine if the disk that has suffered checksum errors needs to be replaced or if we can simply clear the log.

ZFS has managed to repair everything with 0 errors and all the disks are back up and working fine, I'll clear the log:

# zpool clear pool1
# zpool status
  pool: pool1
 state: ONLINE
  scan: scrub repaired 0B in 0h3m with 0 errors on Tue Apr 30 01:59:34 2019
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     0
            ata-ST31000340NS_9QJ0ES1V  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0ET8D  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0EZZC  ONLINE       0     0     0

errors: No known data errors

ZFS - A second drive failure during a replacement

The most dredded situation in any RAID-5 setup is that a second drive fails during a restoration of the pool.

Let's see what's going to happen.

I have created a new pool with three disks and have transfered all the files from the client to the pool.

On the client:

# ls -gG /pool1/pub/tmp/
total 47803477
-rwxrw-r-- 1    18576345 Apr 21 09:08 1.pdf
-rwxrw-r-- 1    30255102 Apr 21 09:08 2.pdf
-rwxrw-r-- 1    22016195 Apr 21 09:08 3.pdf
-rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv
-rwxrw-r-- 1   625338368 Mar  5  2018 boo.iso
-rwxrw-r-- 1  1548841922 Apr 15 23:50 foo.mkv
-rwxrw-r-- 1   415633408 Mar  5  2018 moo.iso
-rwxrw-r-- 1 10867033488 Apr 22 21:10 zoo.mkv

On the ZFS machine:

# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
pool1      45.6G  1.71T   128K  /pool1
pool1/pub  45.6G  1.71T  45.6G  /pool1/pub

I have then removed one of the drives from the pool to simulate the first break down:

# zpool status
  pool: pool1
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

        NAME                           STATE     READ WRITE CKSUM
        pool1                          DEGRADED     0     0     0
          raidz1-0                     DEGRADED     0     0     0
            ata-ST31000340NS_9QJ089LF  ONLINE       0     0     0
            ata-ST31000340NS_9QJ0DVN2  ONLINE       0     0     0
            1368416530025724573        UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1

errors: No known data errors

I am going to begin a replace procedure and while the resilvering of the new drive is running I am goint to disconnect one of the working drives.

# zpool replace -f pool1 ata-ST31000340NS_9QJ0ES1V ata-ST31000340NS_9QJ0EQ1V

Let's check the status:

# zpool status
  pool: pool1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri May  3 23:23:37 2019
        1.19G scanned out of 68.5G at 101M/s, 0h11m to go
        404M resilvered, 1.74% done
config:

        NAME                             STATE     READ WRITE CKSUM
        pool1                            DEGRADED     0     0     0
          raidz1-0                       DEGRADED     0     0     0
            ata-ST31000340NS_9QJ089LF    ONLINE       0     0     0
            ata-ST31000340NS_9QJ0DVN2    ONLINE       0     0     0
            replacing-2                  DEGRADED     0     0     0
              1368416530025724573        UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1
              ata-ST31000340NS_9QJ0EQ1V  ONLINE       0     0     0  (resilvering)

errors: No known data errors

The resilvering is running and I am now disconnecting a second drive by removing the power cord for the drive. ZFS had any time to fully resilver the drive.

# zpool status
  pool: pool1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri May  3 23:23:37 2019
        11.3G scanned out of 68.5G at 138M/s, 0h7m to go
        2.38G resilvered, 16.53% done
config:

        NAME                             STATE     READ WRITE CKSUM
        pool1                            DEGRADED     0     0 22.2K
          raidz1-0                       DEGRADED     0     0 44.5K
            ata-ST31000340NS_9QJ089LF    DEGRADED     0     0     0  too many errors
            ata-ST31000340NS_9QJ0DVN2    UNAVAIL      0     0     0
            replacing-2                  DEGRADED     0     0     0
              1368416530025724573        UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1
              ata-ST31000340NS_9QJ0EQ1V  ONLINE       0     0     0  (resilvering)

errors: 22768 data errors, use '-v' for a list

The resilvering has run its course, but could not finish. ZFS not only informs us about the problem, but it also informs us about the files that are now unrecoverable.

# zpool status -v
  pool: pool1
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 2.38G in 0h7m with 235402 errors on Fri May  3 23:30:48 2019
config:

        NAME                             STATE     READ WRITE CKSUM
        pool1                            DEGRADED     0     0  230K
          raidz1-0                       DEGRADED     0     0  461K
            ata-ST31000340NS_9QJ089LF    DEGRADED     0     0     0  too many errors
            ata-ST31000340NS_9QJ0DVN2    UNAVAIL      0     0     0
            replacing-2                  DEGRADED     0     0     0
              1368416530025724573        UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1
              ata-ST31000340NS_9QJ0EQ1V  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /pool1/pub/tmp/bar.mkv
        /pool1/pub/tmp/foo.mkv
        /pool1/pub/tmp/moo.iso
        pool1/pub:<0x24>
        pool1/pub:<0x25>
        /pool1/pub/tmp/boo.iso
        /pool1/pub/tmp/zoo.mkv

In this situation trying to run any kind of repair process would not only be futile, but it would also be wrong. The filesystem itself isn't damaged and it doesn't require any kind of repairing.

The question is: What can we do the get as much data back from the broken pool as possible?

Let's run a scrub and see if by any change we can salvage some files and then restore as much of the pool as possible:

# zpool scrub pool1

Let's check:

# zpool status -v
  pool: pool1
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0h1m with 1277 errors on Fri May  3 23:40:12 2019
config:

        NAME                             STATE     READ WRITE CKSUM
        pool1                            DEGRADED     0     0  235K
          raidz1-0                       DEGRADED     0     0  479K
            ata-ST31000340NS_9QJ089LF    DEGRADED     0     0     0  too many errors
            ata-ST31000340NS_9QJ0DVN2    UNAVAIL      0     0     0
            replacing-2                  DEGRADED     0     0     0
              1368416530025724573        UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1
              ata-ST31000340NS_9QJ0EQ1V  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /pool1/pub/tmp/bar.mkv
        /pool1/pub/tmp/foo.mkv
        /pool1/pub/tmp/moo.iso
        pool1/pub:<0x24>
        pool1/pub:<0x25>
        /pool1/pub/tmp/boo.iso
        /pool1/pub/tmp/zoo.mkv

As expected this was a no go, you cannot do a scrub on a RAID-Z pool with only one original disk and a second one that hasn't been resilvered correctly.

Without extensive debugging of the filesystem, the only thing left is to see if we can copy any of the healthy files from the pool to the client. ZFS has already told us which files are corrupted.

As a first attempt I want to see if I can mount the directory on the client and then grab files one at a time:

$ rsync -a --progress --stats mnt/testbox/pub/tmp/ tmp3/
sending incremental file list
./
sending incremental file list
./
1.pdf
     18,576,345 100%  109.84MB/s    0:00:00 (xfr#1, to-chk=7/9)
2.pdf
     30,255,102 100%   67.89MB/s    0:00:00 (xfr#2, to-chk=6/9)
3.pdf
     22,016,195 100%   33.92MB/s    0:00:00 (xfr#3, to-chk=5/9)

The file transfer halted at "3.pdf".

I then tried copying files over picking one at a time but I could not get any file except the three pdf file - as ZFS already told me.

I got the following error on the client:

Cannot read source file. Bad file descriptor.

So these are the files that I managed to salvage from my broken RAID-5 pool:

$ ls -gG
total 69196
-rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf
-rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf
-rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf

This means that I have reached a point where my RAID-Z pool has been destroyed and the files I am apple to restore is very limited. This isn't a surprise as ZFS is extremely good at spreading the data and parity data very evenly across multiple drives in a RAID-Z. If you lose two drives in a RAID-Z you almost always lose the entire pool.

Had the resilvering process managed to run for a longer time before the second drive "failed", perhaps I would have been able to salvage more files, but there really isn't anything more I can do now.

In my humble opinion a RAID-Z2 (RAID-6) is a minimum for very important files, but RAID-5 is still extremely useful too as long as you remember to always keep backup of your important data no matter what RAID setup you're using. A RAID setup is never a substitute for backup!

Alright, time to do some testing on Btrfs.

Btrfs RAID-5

According to the Btrfs wiki:

The parity RAID feature is mostly implemented, but has some problems in the case of power failure (or other unclean shutdown) which lead to damaged data. It is recommended that parity RAID be used only for testing purposes.

Let's setup a Btrfs RAID-5 system:

# mkfs.btrfs -f -m raid5 -d raid5 /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /dev/disk/by-id/ata-ST31000340NS_9QJ0EQ1V /dev/disk/by-id/ata-ST31000340NS_9QJ0F2YQ
btrfs-progs v4.20.2
See http://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:               520d615b-4151-4036-962a-ccc202e1f76c
Node size:          16384
Sector size:        4096
Filesystem size:    2.73TiB
Block group profiles:
  Data:             RAID5             2.00GiB
  Metadata:         RAID5             2.00GiB
  System:           RAID5            16.00MiB
SSD detected:       no
Incompat features:  extref, raid56, skinny-metadata
Number of devices:  3
Devices:
   ID        SIZE  PATH
    1   931.51GiB  /dev/disk/by-id/ata-ST31000340NS_9QJ089LF
    2   931.51GiB  /dev/disk/by-id/ata-ST31000340NS_9QJ0EQ1V
    3   931.51GiB  /dev/disk/by-id/ata-ST31000340NS_9QJ0F2YQ

Then enable lzo compression and mount the pool:

# mount -o noatime,compress=lzo /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /pub/
# btrfs filesystem show -d
Label: none  uuid: 520d615b-4151-4036-962a-ccc202e1f76c
        Total devices 3 FS bytes used 128.00KiB
        devid    1 size 931.51GiB used 2.01GiB path /dev/sdc
        devid    2 size 931.51GiB used 2.01GiB path /dev/sdb
        devid    3 size 931.51GiB used 2.01GiB path /dev/sdd

# btrfs device stats /pub/
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0
[/dev/sdb].write_io_errs    0
[/dev/sdb].read_io_errs     0
[/dev/sdb].flush_io_errs    0
[/dev/sdb].corruption_errs  0
[/dev/sdb].generation_errs  0
[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     0
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  0
[/dev/sdd].generation_errs  0

Time to transfer the files from the client using rsync:

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/
1.pdf
     18,576,345 100%  165.28MB/s    0:00:00 (xfr#1, to-chk=6/8)
2.pdf
     30,255,102 100%   84.86MB/s    0:00:00 (xfr#2, to-chk=5/8)
3.pdf
     22,016,195 100%   31.81MB/s    0:00:00 (xfr#3, to-chk=4/8)
bar.mkv
 35,456,180,485 100%  107.72MB/s    0:05:13 (xfr#4, to-chk=3/8)
boo.iso
    625,338,368 100%   21.36MB/s    0:00:27 (xfr#5, to-chk=2/8)
foo.mkv
  1,548,841,922 100%  131.10MB/s    0:00:11 (xfr#6, to-chk=1/8)
moo.iso
    415,633,408 100%   24.38MB/s    0:00:16 (xfr#7, to-chk=0/8)

Number of files: 8 (reg: 7, dir: 1)
Number of created files: 8 (reg: 7, dir: 1)
Number of deleted files: 0
Number of regular files transferred: 7
Total file size: 38,116,841,825 bytes
Total transferred file size: 38,116,841,825 bytes
Literal data: 38,116,841,825 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 38,126,148,151
Total bytes received: 202

sent 38,126,148,151 bytes  received 202 bytes  102,078,041.11 bytes/sec
total size is 38,116,841,825  speedup is 1.00

Compared to the ZFS RAID-Z1 transfer:

sent 38,126,148,150 bytes  received 202 bytes  106,945,717.68 bytes/sec

On the Btrfs machine I receive a clear warning about some of the missing functionality of the RAID5/6 capability which is also described on the Btrfs wiki status:

The write hole is the last missing part, preliminary patches have been posted but needed to be reworked. The parity not checksummed note has been removed.

# btrfs filesystem usage /pub
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
    Device size:                   2.73TiB
    Device allocated:                0.00B
    Device unallocated:            2.73TiB
    Device missing:                  0.00B
    Used:                            0.00B
    Free (estimated):                0.00B      (min: 8.00EiB)
    Data ratio:                       0.00
    Metadata ratio:                   0.00
    Global reserve:               40.25MiB      (used: 0.00B)

Data,RAID5: Size:36.00GiB, Used:35.51GiB
   /dev/sdb       18.00GiB
   /dev/sdc       18.00GiB
   /dev/sdd       18.00GiB

Metadata,RAID5: Size:2.00GiB, Used:40.44MiB
   /dev/sdb        1.00GiB
   /dev/sdc        1.00GiB
   /dev/sdd        1.00GiB

System,RAID5: Size:16.00MiB, Used:16.00KiB
   /dev/sdb        8.00MiB
   /dev/sdc        8.00MiB
   /dev/sdd        8.00MiB

Unallocated:
   /dev/sdb      912.50GiB
   /dev/sdc      912.50GiB
   /dev/sdd      912.50GiB

Btrfs - Power outage

I have then again added the "zoo.mkv" file to the files on the client and will begin the rsync transfer and pull the power cord to the Btrfs machine at about 50% of the transfer.

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/
sending incremental file list
zoo.mkv
 5,887,590,400  54%   71.49kB/s   19:20:49
^C

The power cord has been pulled. I have aborted the file transfer on the client and the Btrfs machine has been powered back up again:

# btrfs filesystem show -d
Label: none  uuid: 520d615b-4151-4036-962a-ccc202e1f76c
        Total devices 3 FS bytes used 35.55GiB
        devid    1 size 931.51GiB used 19.01GiB path /dev/sdc
        devid    2 size 931.51GiB used 19.01GiB path /dev/sdb
        devid    3 size 931.51GiB used 19.01GiB path /dev/sdd

# btrfs device stats /pub
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0
[/dev/sdb].write_io_errs    0
[/dev/sdb].read_io_errs     0
[/dev/sdb].flush_io_errs    0
[/dev/sdb].corruption_errs  0
[/dev/sdb].generation_errs  0
[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     0
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  0
[/dev/sdd].generation_errs  0

Btrfs is also a transactional filesystem and the pool is back up. There are no errors and everything is mountable from the client. As with the ZFS test we have only lost the file that was being transfered.

Btrfs - Drive failure

Time to simulate a drive failure. I will remove the same drive as with ZFS, then afterwards attach a new drive and try to restore the pool.

# btrfs filesystem show -d
warning, device 1 is missing
checksum verify failed on 83820544 found C780E0CF wanted 23635D79
bad tree block 83820544, bytenr mismatch, want=83820544, have=65536
Label: none  uuid: 520d615b-4151-4036-962a-ccc202e1f76c
        Total devices 3 FS bytes used 35.55GiB
        devid    2 size 931.51GiB used 19.01GiB path /dev/sdb
        devid    3 size 931.51GiB used 19.01GiB path /dev/sdd
        *** Some devices missin

Btrfs is informing us about the missing disk. Let's locate the new one and replace the old with it:

$ ls -gG /dev/disk/by-id
ata-ST31000340NS_9QJ0DVN2 -> ../../sdc

I need to mount the pool in a degraded state with one of the working disks:

# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ0F2YQ /pub/

Then because the "broken" device has been removed I have to use the "devid" parameter format in order to replace the device. This is one place in the Btrfs documentation that could benefit from an example.

The "devid" is the missing device ID from the btrfs filesystem show -d command, not from the "by-id" or "uuid". Also since the new disk already contains a filesystem from the previous test I need to use the -f option to force the command:

So the command basically is: btrfs replace start old_device new_device mount_point where old_device is the "devid" number Btrfs has supplied us with:

# btrfs replace start -f 1 /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 /pub

We can then check the status of the replacement:

# btrfs replace status -1 /pub
0.4% done, 0 write errs, 0 uncorr. read errs

# iostat -dh /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2
Linux 5.0.9-arch1-1-ARCH (testbox)         04/25/2019      _x86_64_        (2 CPU)

      tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd Device
   148.08         5.1k        11.9M         0.0k       5.1M      11.7G       0.0k sdc

ZFS completed the restoration in about 3 minutes while Btrfs was a little more than twice as long about it:

# btrfs replace status -1 /pub
Started on 25.Apr 01:39:20, finished on 25.Apr 01:46:59, 0 write errs, 0 uncorr. read errs

The pool is back up again with no missing files or any other problems:

# btrfs filesystem show -d
Label: none  uuid: 520d615b-4151-4036-962a-ccc202e1f76c
        Total devices 3 FS bytes used 35.55GiB
        devid    1 size 931.51GiB used 19.00GiB path /dev/sdc
        devid    2 size 931.51GiB used 20.03GiB path /dev/sdb
        devid    3 size 931.51GiB used 20.03GiB path /dev/sdd

# btrfs filesystem usage /pub
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
    Device size:                   2.73TiB
    Device allocated:                0.00B
    Device unallocated:            2.73TiB
    Device missing:                  0.00B
    Used:                            0.00B
    Free (estimated):                0.00B      (min: 8.00EiB)
    Data ratio:                       0.00
    Metadata ratio:                   0.00
    Global reserve:               40.25MiB      (used: 0.00B)

Data,RAID5: Size:36.00GiB, Used:35.51GiB
   /dev/sdb       18.00GiB
   /dev/sdc       18.00GiB
   /dev/sdd       18.00GiB

Metadata,RAID5: Size:3.00GiB, Used:40.44MiB
   /dev/sdb        2.00GiB
   /dev/sdc        1.00GiB
   /dev/sdd        2.00GiB

System,RAID5: Size:32.00MiB, Used:16.00KiB
   /dev/sdb       32.00MiB
   /dev/sdd       32.00MiB

Unallocated:
   /dev/sdb      911.48GiB
   /dev/sdc      912.51GiB
   /dev/sdd      911.48GiB

# btrfs device stats /pub
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0
[/dev/sdb].write_io_errs    0
[/dev/sdb].read_io_errs     0
[/dev/sdb].flush_io_errs    0
[/dev/sdb].corruption_errs  0
[/dev/sdb].generation_errs  0
[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     0
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  0
[/dev/sdd].generation_errs  0

In the above I have noticed that the files are not as equally spread across the devices as before the simulated failure.

Before:

devid    1 size 931.51GiB used 19.01GiB path /dev/sdc
devid    2 size 931.51GiB used 19.01GiB path /dev/sdb
devid    3 size 931.51GiB used 19.01GiB path /dev/sdd

After:

devid    1 size 931.51GiB used 19.00GiB path /dev/sdc
devid    2 size 931.51GiB used 20.03GiB path /dev/sdb
devid    3 size 931.51GiB used 20.03GiB path /dev/sdd

But the usage command shows that this is only due to metadata:

# btrfs filesystem usage /pub
...
Metadata,RAID5: Size:3.00GiB, Used:51.58MiB
   /dev/sdb        2.00GiB
   /dev/sdc        1.00GiB
   /dev/sdd        2.00GiB

Let's perform a scrub now and validate that everything is alright:

# btrfs scrub start /pub/
# btrfs scrub status -d /pub/
scrub status for 520d615b-4151-4036-962a-ccc202e1f76c
scrub device /dev/sdc (id 1) history
        scrub started at Thu Apr 25 04:04:57 2019 and finished after 00:28:11
        total bytes scrubbed: 15.23GiB with 0 errors
scrub device /dev/sdb (id 2) history
        scrub started at Thu Apr 25 04:04:57 2019 and finished after 00:27:31
        total bytes scrubbed: 15.23GiB with 0 errors
scrub device /dev/sdd (id 3) history
        scrub started at Thu Apr 25 04:04:57 2019 and finished after 00:27:32
        total bytes scrubbed: 15.23GiB with 0 errors

So far no problems.

Btrfs - Drive failure during file transfer

Now it's time to remove a drive during an active file transfer:

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/
sending incremental file list
zoo.mkv
 10,867,033,488 100%  119.28MB/s    0:01:26 (xfr#3, to-chk=0/9)

Number of files: 9 (reg: 8, dir: 1)
Number of created files: 1 (reg: 1)
Number of deleted files: 0
Number of regular files transferred: 3
Total file size: 48,983,875,313 bytes
Total transferred file size: 10,919,304,785 bytes
Literal data: 10,919,304,785 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 10,921,970,963
Total bytes received: 76

sent 10,921,970,963 bytes  received 76 bytes  96,228,819.73 bytes/sec
total size is 48,983,875,313  speedup is 4.48

Btrfs reacted exactly the same way ZFS did. It momentarily halted the file transfer for about a second, then resumed the transfer without the client being able to notice anything other than the momentary drop in the file transfer speed.

On the Btrfs machine the pool has changed the state to a missing device:

# btrfs filesystem show -d /pub
Label: none  uuid: e4f04b17-c62b-4847-beeb-753bbb64c79a
        Total devices 3 FS bytes used 36.99GiB
        devid    1 size 931.51GiB used 21.01GiB path /dev/sdc
        devid    2 size 931.51GiB used 21.01GiB path /dev/sdd
        *** Some devices missing

# btrfs filesystem usage /pub
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
    Device size:                   2.73TiB
    Device allocated:                0.00B
    Device unallocated:            2.73TiB
    Device missing:              931.51GiB
    Used:                            0.00B
    Free (estimated):                0.00B      (min: 8.00EiB)
    Data ratio:                       0.00
    Metadata ratio:                   0.00
    Global reserve:               51.75MiB      (used: 0.00B)

Data,RAID5: Size:48.00GiB, Used:45.63GiB
   /dev/sdb       24.00GiB
   /dev/sdc       24.00GiB
   /dev/sdd       24.00GiB

Metadata,RAID5: Size:2.00GiB, Used:51.92MiB
   /dev/sdb        1.00GiB
   /dev/sdc        1.00GiB
   /dev/sdd        1.00GiB

System,RAID5: Size:16.00MiB, Used:16.00KiB
   /dev/sdb        8.00MiB
   /dev/sdc        8.00MiB
   /dev/sdd        8.00MiB

Unallocated:
   /dev/sdb      906.50GiB
   /dev/sdc      906.50GiB
   /dev/sdd      906.50GiB

As with ZFS I powered down the machine in order to safely reattach the drive and then rebooted.

#  btrfs filesystem show -d
Label: none  uuid: e4f04b17-c62b-4847-beeb-753bbb64c79a
        Total devices 3 FS bytes used 35.38GiB
        devid    1 size 931.51GiB used 25.01GiB path /dev/sdc
        devid    2 size 931.51GiB used 25.01GiB path /dev/sdd
        devid    3 size 931.51GiB used 19.01GiB path /dev/sdb

The show command reveals that the pool is not in balance. In order to get more information I need to mount the pool and then use the device stats command.

The device stats keep a persistent record of several error classes related to doing IO. The current values are printed at mount time and updated during filesystem lifetime or from a scrub:

# btrfs device stats /pub/
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0
[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     0
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  0
[/dev/sdd].generation_errs  0
[/dev/sdb].write_io_errs    16
[/dev/sdb].read_io_errs     1
[/dev/sdb].flush_io_errs    5
[/dev/sdb].corruption_errs  0
[/dev/sdb].generation_errs  0

The status report clearly shows write errors.

In the current situation the correct approach is to do a scrub:

# btrfs scrub start /pub/
scrub started on /pub/, fsid e4f04b17-c62b-4847-beeb-753bbb64c79a (pid=583)

As with ZFS, Btrfs is now running through the data and the checksums and is trying to repair the data.

After the scrubbing is done Btrfs tells us that it has repaired quite a lot of data, all with 0 uncorrectable errors.

# btrfs scrub status -d /pub/
scrub status for e4f04b17-c62b-4847-beeb-753bbb64c79a
scrub device /dev/sdc (id 1) history
        scrub started at Fri Apr 26 03:18:57 2019 and finished after 00:49:34
        total bytes scrubbed: 15.23GiB with 25452 errors
        error details: csum=25452
        corrected errors: 25452, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdd (id 2) history
        scrub started at Fri Apr 26 03:18:57 2019 and finished after 00:47:19
        total bytes scrubbed: 15.23GiB with 27768 errors
        error details: csum=27768
        corrected errors: 27768, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdb (id 3) history
        scrub started at Fri Apr 26 03:18:57 2019 and finished after 00:49:34
        total bytes scrubbed: 15.23GiB with 2316 errors
        error details: csum=2316
        corrected errors: 2316, uncorrectable errors: 0, unverified errors: 0

Btrfs has managed to repair everything without any errors but it still keeps the logs of the errors.

What is noticeable is that ZFS finished the scrubbing and repair in just about 6 minutes while Btrfs took about 50 minutes.

This is because Btrfs has also brought the pool into balance during the scrubbing and Btrfs is famous for being very slow at re-balancing drives:

# btrfs filesystem show -d
Label: none  uuid: e4f04b17-c62b-4847-beeb-753bbb64c79a
        Total devices 3 FS bytes used 45.68GiB
        devid    1 size 931.51GiB used 25.56GiB path /dev/sdc
        devid    2 size 931.51GiB used 25.56GiB path /dev/sdd
        devid    3 size 931.51GiB used 25.56GiB path /dev/sdb

Again it is up to the system administrator to decide what he or she wants to do. And this is important because even though Btrfs has managed to restore the pool we might still be dealing with am unhealthy device.

Can we clear the log? Or do we need to replace the drive anyway? Perhaps S.M.A.R.T has warned us that the drive is currently working, but it is experiencing occasional issues and soon needs to be fully replaced.

# btrfs device stats -c /pub/
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0
[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     0
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  0
[/dev/sdd].generation_errs  0
[/dev/sdb].write_io_errs    16
[/dev/sdb].read_io_errs     1
[/dev/sdb].flush_io_errs    5
[/dev/sdb].corruption_errs  55536
[/dev/sdb].generation_errs  0

The result of the scrubbing showed zero uncorrectable errors and I know the drive is working fine so I'll just clear the log with the -z option:

# btrfs device stats -z /pub/
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0
[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     0
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  0
[/dev/sdd].generation_errs  0
[/dev/sdb].write_io_errs    0
[/dev/sdb].read_io_errs     0
[/dev/sdb].flush_io_errs    0
[/dev/sdb].corruption_errs  0
[/dev/sdb].generation_errs  0

Btrfs - Data corruption during file transfer

Now I want to simulate the same disk corruption in the middle of a file transfer from the client as I did with ZFS.

I have removed the "zoo.mkv" file and while rsync is running I will use dd a couple of times on the Btrfs machine on one of the drives:

# dd if=/dev/urandom of=/dev/disk/by-id/ata-ST31000340NS_9QJ0ET8D seek=100000 count=1000 bs=1k

The device stats command did not show any problems:

# btrfs device stats -c /pub
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0
[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     0
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  0
[/dev/sdd].generation_errs  0
[/dev/sdb].write_io_errs    0
[/dev/sdb].read_io_errs     0
[/dev/sdb].flush_io_errs    0
[/dev/sdb].corruption_errs  0
[/dev/sdb].generation_errs  0

However, both dmesg and the log reveals something:

[ 1932.091249] BTRFS error (device sdc): csum mismatch on free space cache
[ 1932.091262] BTRFS warning (device sdc): failed to load free space cache for block group 42988470272, rebuilding it now
[ 1932.334063] BTRFS error (device sdc): csum mismatch on free space cache
[ 1932.334076] BTRFS warning (device sdc): failed to load free space cache for block group 47283437568, rebuilding it now
[ 2005.178214] BTRFS error (device sdc): space cache generation (17) does not match inode (19)
[ 2005.178222] BTRFS warning (device sdc): failed to load free space cache for block group 38693502976, rebuilding it now

Btrfs did detect the problem and automatically fixed it, but I had expected this kind of error to show up in the device stat result perhaps as a corruption error count.

Btrfs - The dd mistake

Time to to see what's going to happen if I by mistake type the dd command on one of the drives during a file transfer from the client.

As with the ZFS test I have deleted all the files, restarted rsync:

# dd if=/dev/urandom of=/dev/sdb bs=1k
^C232089+0 records in
232089+0 records out
237659136 bytes (238 MB, 227 MiB) copied, 27.3843 s, 8.7 MB/s

Again the device stats command didn't show any problems:

# btrfs device stats -c /pub/
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0
[/dev/sdb].write_io_errs    0
[/dev/sdb].read_io_errs     0
[/dev/sdb].flush_io_errs    0
[/dev/sdb].corruption_errs  0
[/dev/sdb].generation_errs  0
[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     0
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  0
[/dev/sdd].generation_errs  0

But dmesg did after about a minut:

[  867.808813] BTRFS error (device sdc): bad tree block start, want 53133312 have 17920600362259148199
[  867.848391] BTRFS info (device sdc): read error corrected: ino 0 off 53133312 (dev /dev/sdb sector 32480)
[  867.886255] BTRFS info (device sdc): read error corrected: ino 0 off 53137408 (dev /dev/sdb sector 32488)
[  867.893746] BTRFS info (device sdc): read error corrected: ino 0 off 53141504 (dev /dev/sdb sector 32496)
[  867.903079] BTRFS info (device sdc): read error corrected: ino 0 off 53145600 (dev /dev/sdb sector 32504)
[  867.928986] BTRFS error (device sdc): bad tree block start, want 53100544 have 125614526405871379
[  867.946912] BTRFS info (device sdc): read error corrected: ino 0 off 53100544 (dev /dev/sdb sector 32416)
[  867.948135] BTRFS info (device sdc): read error corrected: ino 0 off 53104640 (dev /dev/sdb sector 32424)
[  867.948793] BTRFS info (device sdc): read error corrected: ino 0 off 53108736 (dev /dev/sdb sector 32432)
[  867.952210] BTRFS info (device sdc): read error corrected: ino 0 off 53112832 (dev /dev/sdb sector 32440)
[  868.128686] BTRFS error (device sdc): bad tree block start, want 43614208 have 15420301482281005013
[  868.130861] BTRFS error (device sdc): bad tree block start, want 43614208 have 15420301482281005013
[  868.196118] BTRFS error (device sdc): bad tree block start, want 43614208 have 15420301482281005013
[  868.296277] BTRFS error (device sdc): bad tree block start, want 43614208 have 15420301482281005013
[  868.333942] BTRFS info (device sdc): read error corrected: ino 0 off 43614208 (dev /dev/sdb sector 23104)
[  868.337820] BTRFS info (device sdc): read error corrected: ino 0 off 43618304 (dev /dev/sdb sector 23112)
[  868.353572] BTRFS error (device sdc): bad tree block start, want 43630592 have 10676903441545527670
[  868.378400] BTRFS error (device sdc): bad tree block start, want 43597824 have 485580186567037103
[  868.531339] BTRFS error (device sdc): bad tree block start, want 46039040 have 1852668134064264900
[  868.569488] BTRFS error (device sdc): bad tree block start, want 46055424 have 418370625237599952

On the client, as with ZFS, there is nothing noticeable going on during file transfer.

Time to run a scrub in order to correct the errors:

# btrfs scrub start /pub/
scrub started on /pub/, fsid 045b8eb9-267a-479b-92af-a996d9a27d12 (pid=468)

# btrfs scrub status -d /pub/
scrub status for 045b8eb9-267a-479b-92af-a996d9a27d12
scrub device /dev/disk/by-id/ata-ST31000340NS_9QJ089LF (id 1) status
        scrub started at Tue Apr 30 22:58:47 2019, running for 00:01:00
        total bytes scrubbed: 582.61MiB with 9 errors
        error details: csum=9
        corrected errors: 9, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdb (id 2) status
        scrub started at Tue Apr 30 22:58:47 2019, running for 00:01:00
        total bytes scrubbed: 543.27MiB with 639 errors
        error details: csum=639
        corrected errors: 639, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdd (id 3) status
        scrub started at Tue Apr 30 22:58:47 2019, running for 00:01:00
        total bytes scrubbed: 480.40MiB with 2 errors
        error details: csum=2
        corrected errors: 2, uncorrectable errors: 0, unverified errors: 0

WARNING: errors detected during scrubbing, corrected

Btrfs has detected the errors and fixed them:

scrub status for 045b8eb9-267a-479b-92af-a996d9a27d12
scrub device /dev/disk/by-id/ata-ST31000340NS_9QJ089LF (id 1) history
        scrub started at Tue Apr 30 22:58:47 2019 and finished after 00:25:08
        total bytes scrubbed: 15.23GiB with 9 errors
        error details: csum=9
        corrected errors: 9, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdb (id 2) history
        scrub started at Tue Apr 30 22:58:47 2019 and finished after 00:25:06
        total bytes scrubbed: 15.23GiB with 639 errors
        error details: csum=639
        corrected errors: 639, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdd (id 3) history
        scrub started at Tue Apr 30 22:58:47 2019 and finished after 00:25:16
        total bytes scrubbed: 15.23GiB with 2 errors
        error details: csum=2
        corrected errors: 2, uncorrectable errors: 0, unverified errors: 0

# btrfs device stats -c /pub/
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0
[/dev/sdb].write_io_errs    0
[/dev/sdb].read_io_errs     0
[/dev/sdb].flush_io_errs    0
[/dev/sdb].corruption_errs  650
[/dev/sdb].generation_errs  0
[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     0
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  0
[/dev/sdd].generation_errs  0

Btrfs handled the problem just as well as ZFS. The only difference was the time it took to do the scrub.

Btrfs - The "write hole" issue

Since Btrfs still has warnings about the write hole issue I would like to see if it's possible to recreate the problem in this test.

Parity may be inconsistent after a crash (the "write hole"). The problem born when after "an unclean shutdown" a disk failure happens. But these are two distinct failures. These together break the BTRFS raid5 redundancy. If you run a scrub process after "an unclean shutdown" (with no disk failure in between) those data which match their checksum can still be read out while the mismatched data are lost forever.

These two issues has to exist at the same time:

An unclean shutdown.
A disk failure.

So pulling the power cord to the machine during a file transfer and then simulating a disk failure by removing one of the drives should potentially re-create the issue.

I have removed the "zoo.mkv" file from the files on the Btrfs machine and will pull the power cord during the file transfer of the file and will then remove a drive and see what's going to happen.

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/
sending incremental file list
zoo.mkv
  7,176,814,592  66%   58.80kB/s   17:25:53
^C

The Btrfs machine has now suffered an unclean shutdown. I have aborted the file transfer on the client and unmounted the Btrfs export. I have then physically changed one of the drives in the Btrfs machine and will now try to do a replacement.

# btrfs filesystem show -d
warning, device 2 is missing
checksum verify failed on 117506048 found E6CE304B wanted 022D8DFD
bad tree block 117506048, bytenr mismatch, want=117506048, have=65536
Couldn't setup extent tree
checksum verify failed on 117538816 found 151B2790 wanted F1F89A26
bad tree block 117538816, bytenr mismatch, want=117538816, have=65536
Couldn't setup device tree
Label: none  uuid: 045b8eb9-267a-479b-92af-a996d9a27d12
        Total devices 3 FS bytes used 38.61GiB
        devid    1 size 931.51GiB used 21.01GiB path /dev/sdc
        devid    3 size 931.51GiB used 21.01GiB path /dev/sdd
        *** Some devices missin

In the previous test where I simulated a drive failure I got the same error messages except that this time Btrfs is complaining about "couldn't setup device tree".

I will now mount the pool in a degraded state and replace the faulty drive and see if we can't salvage any data from the pool. The mounting has to be performed with a healty drive:

# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ0 /pub

This time it is "devid" 2 I need to replace. The new disk is the "9QJ0ET8D" one:

# btrfs replace start -f 2 /dev/disk/by-id/ata-ST31000340NS_9QJ0ET8D /pub/

Let's check the status of the replacement:

# btrfs replace status -1 /pub/
0.4% done, 0 write errs, 0 uncorr. read errs

Then after a little while:

# btrfs replace status -1 /pub/
Started on 30.Apr 23:58:34, finished on  1.May 00:06:39, 0 write errs, 0 uncorr. read errs

# btrfs filesystem show -d
Label: none  uuid: 045b8eb9-267a-479b-92af-a996d9a27d12
        Total devices 3 FS bytes used 38.61GiB
        devid    1 size 931.51GiB used 22.04GiB path /dev/sdc
        devid    2 size 931.51GiB used 21.00GiB path /dev/sdb
        devid    3 size 931.51GiB used 22.04GiB path /dev/sdd

# btrfs device stats -c /pub
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0
[/dev/sdb].write_io_errs    0
[/dev/sdb].read_io_errs     0
[/dev/sdb].flush_io_errs    0
[/dev/sdb].corruption_errs  0
[/dev/sdb].generation_errs  0
[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     0
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  0
[/dev/sdd].generation_errs  0

# ls -gG /pub/tmp/
total 37223496
-rwxrw-r-- 1    18576345 Apr 21 09:08 1.pdf
-rwxrw-r-- 1    30255102 Apr 21 09:08 2.pdf
-rwxrw-r-- 1    22016195 Apr 21 09:08 3.pdf
-rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv
-rwxrw-r-- 1   625338368 Mar  5  2018 boo.iso
-rwxrw-r-- 1  1548841922 Apr 15 23:50 foo.mkv
-rwxrw-r-- 1   415633408 Mar  5  2018 moo.iso

Everything has been restored nicely and all three drives are performing well. I didn't lose any files or suffered any parity issues that made the replacement a problem.

I have repeated the above test with the same result more than once.

Btrfs - A second drive failure during a replacement

Now I want to see what's going to happen with Btrfs when I lose a second drive during a replacement procedure.

I have removed one of the drives and I mounting the Btrfs pool in a degraded state in order to begin a replacement:

# btrfs filesystem show -d
Label: none  uuid: 045b8eb9-267a-479b-92af-a996d9a27d12
        Total devices 3 FS bytes used 38.61GiB
        devid    1 size 931.51GiB used 22.03GiB path /dev/sdc
        devid    3 size 931.51GiB used 22.03GiB path /dev/sdd
        *** Some devices missing

# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /pub

As I did with ZFS, while the replacement procedure is running I will disconnect one of the working drives.

# btrfs replace start -f 2 /dev/disk/by-id/ata-ST31000340NS_9QJ0ET8D /pub/

Let's check the status:

# btrfs replace status -1 /pub/
0.1% done, 0 write errs, 0 uncorr. read errs

I now disconnect a second drive by removing the power cord for the drive:

# btrfs replace status -1 /pub/
Started on  3.May 21:03:21, canceled on  3.May 21:04:12 at 0.0%, 0 write errs, 0 uncorr. read errs

Btrfs cancelled the replacement when the second drive went offline.

# ls -gG /pub/tmp/
ls: cannot access '/pub/tmp/boo.iso': Input/output error
ls: cannot access '/pub/tmp/foo.mkv': Input/output error
ls: cannot access '/pub/tmp/moo.iso': Input/output error
total 34694376
-rwxrw-r-- 1    18576345 Apr 21 09:08 1.pdf
-rwxrw-r-- 1    30255102 Apr 21 09:08 2.pdf
-rwxrw-r-- 1    22016195 Apr 21 09:08 3.pdf
-rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv
-????????? ?           ?            ? boo.iso
-????????? ?           ?            ? foo.mkv
-????????? ?           ?            ? moo.iso

We clearly have a problem.

I have attached a new drive and the Btrfs machine now only has one healthy drive in the pool and two new drives of which one has only been partly replaced.

# umount /pub
# btrfs filesystem show -d
warning, device 3 is missing
Label: none  uuid: 045b8eb9-267a-479b-92af-a996d9a27d12
        Total devices 3 FS bytes used 38.61GiB
        devid    1 size 931.51GiB used 22.03GiB path /dev/sdc
        *** Some devices missing

# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /pub/
# btrfs filesystem show -d
warning, device 3 is missing
checksum verify failed on 119832576 found B67B4ABD wanted A302A7B3
checksum verify failed on 119832576 found B67B4ABD wanted A302A7B3
bad tree block 119832576, bytenr mismatch, want=119832576, have=5117397648563945276
Label: none  uuid: 045b8eb9-267a-479b-92af-a996d9a27d12
        Total devices 3 FS bytes used 38.61GiB
        devid    1 size 931.51GiB used 22.03GiB path /dev/sdc
        devid    2 size 931.51GiB used 21.01GiB path /dev/sdd
        *** Some devices missing

The drive that went through the partly replacement is at least recognized as belonging to the pool.

Now, in this situation trying to run any kind of repair process would not only be futile, but it would also be very wrong. The filesystem isn't damaged and it doesn't require any kind of repairing.

Again I will try to replace the third disk and see if I maybe have enough data and metadata lying around to actually restore the pool without loosing any data (as with ZFS this is very a long shot):

Let's locate the new disk:

# ls -l /dev/disk/by-id/
ata-ST31000340NS_9QJ089LF -> ../../sdc
ata-ST31000340NS_9QJ0DVN2 -> ../../sdd
ata-ST31000340NS_9QJ0ES1V -> ../../sdb

"devid" 3 needs to be replaced with the "9QJ0ES1V" one:

# btrfs replace start -f 3 /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V /pub/

No errors. Let's check the status:

# btrfs replace status -1 /pub/
Started on  3.May 21:03:21, suspended on  1.May 00:06:39 at 0.2%, 0 write errs, 0 uncorr. read errs

Suspended!

Let's see what dmesg says:

[  509.084144] BTRFS info (device sdc): use lzo compression, level 0
[  509.084147] BTRFS info (device sdc): allowing degraded mounts
[  509.084148] BTRFS info (device sdc): disk space caching is enabled
[  509.084150] BTRFS info (device sdc): has skinny extents
[  509.107081] BTRFS warning (device sdc): devid 3 uuid 9078bc78-a5ba-4178-96ca-53fb2e29b62c is missing
[  509.167206] BTRFS info (device sdc): cannot continue dev_replace, tgtdev is missing
[  509.167208] BTRFS info (device sdc): you may cancel the operation after 'mount -o degraded'

So a replacement is not possible.

On ZFS we get much better information using the zpool status -v about both the replacement status and about the specific files cannot be restored.

Let's run a scrub and see if by any change we can salvage some files and then restore as much of the pool as possible:

# btrfs scrub start /pub/
scrub started on /pub/, fsid 045b8eb9-267a-479b-92af-a996d9a27d12 (pid=497)

Let's check:

# btrfs scrub status -d /pub/
scrub status for 045b8eb9-267a-479b-92af-a996d9a27d12
scrub device /dev/sdc (id 1) history
        scrub started at Fri May  3 21:18:18 2019 and was aborted after 00:00:00
        total bytes scrubbed: 0.00B with 0 errors
scrub device /dev/sdd (id 2) history
        scrub started at Fri May  3 21:18:18 2019 and was aborted after 00:00:00
        total bytes scrubbed: 0.00B with 0 errors
scrub device /dev/sdd (id 3) history
        scrub started at Fri May  3 21:18:18 2019 and was aborted after 00:00:00
        total bytes scrubbed: 0.00B with 0 errors

Aborted.

This was a no go, we cannot do a scrub on a RAID-5 pool with only one original disk and a second one that hasn't been replaced correctly.

# ls -gG /pub/tmp/
total 37223496
-rwxrw-r-- 1    18576345 Apr 21 09:08 1.pdf
-rwxrw-r-- 1    30255102 Apr 21 09:08 2.pdf
-rwxrw-r-- 1    22016195 Apr 21 09:08 3.pdf
-rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv
-rwxrw-r-- 1   625338368 Mar  5  2018 boo.iso
-rwxrw-r-- 1  1548841922 Apr 15 23:50 foo.mkv
-rwxrw-r-- 1   415633408 Mar  5  2018 moo.iso

The only thing left is to see how many files I can salvage:

$ rsync -a --progress --stats mnt/testbox/pub/tmp/ tmp3/
sending incremental file list
./
1.pdf
     18,576,345 100%  111.22MB/s    0:00:00 (xfr#1, to-chk=6/8)
2.pdf
     30,255,102 100%   68.05MB/s    0:00:00 (xfr#2, to-chk=5/8)
3.pdf
     22,016,195 100%   33.97MB/s    0:00:00 (xfr#3, to-chk=4/8)
bar.mkv
     41,451,520   0%   39.18MB/s    0:14:42

Then it halted.

I then tried copying files over picking one at a time and to my big supprise I actually managed to get all the files except the "bar.mkv" file!

$ ls -gG
total 2598328
-rwxr-xr-x 1   18576345 Apr 21 09:08 1.pdf
-rwxr-xr-x 1   30255102 Apr 21 09:08 2.pdf
-rwxr-xr-x 1   22016195 Apr 21 09:08 3.pdf
-rwxr-xr-x 1  625338368 Mar  5  2018 boo.iso
-rwxr-xr-x 1 1548841922 Apr 15 23:50 foo.mkv
-rwxr-xr-x 1  415633408 Mar  5  2018 moo.iso

During the attempt to transfer the "bar.mkv" file the following errors showed up on the Btrfs machine:

# dmesg
[ 4177.376785] BTRFS error (device sdc): bad tree block start, want 38944768 have 7071809559058736496
[ 4177.378494] BTRFS error (device sdc): bad tree block start, want 38961152 have 16350034114213725736
[ 4177.378718] BTRFS error (device sdc): bad tree block start, want 38977536 have 8392528330119265768
[ 4177.379183] BTRFS error (device sdc): bad tree block start, want 38928384 have 6084014255993522895
[ 4181.808743] BTRFS critical (device sdc): corrupt node: root=7 block=39124992 slot=0, unaligned pointer, have 12335186693368 should be aligned to 4096
[ 4181.808757] BTRFS info (device sdc): no csum found for inode 261 start 52690944
[ 4181.808856] BTRFS critical (device sdc): corrupt node: root=7 block=39124992 slot=0, unaligned pointer, have 12335186693368 should be aligned to 4096
[ 4181.808866] BTRFS info (device sdc): no csum found for inode 261 start 52695040
[ 4181.808955] BTRFS critical (device sdc): corrupt node: root=7 block=39124992 slot=0, unaligned pointer, have 12335186693368 should be aligned to 4096
[ 4181.808965] BTRFS info (device sdc): no csum found for inode 261 start 52699136
[ 4181.809051] BTRFS critical (device sdc): corrupt node: root=7 block=39124992 slot=0, unaligned pointer, have 12335186693368 should be aligned to 4096
...

Btrfs has the "btrfs restore" command which is used to try to salvage files from a damaged filesystem and restore them somewhere else. The man page explaines:

btrfs restore could be used to retrieve file data, as far as the metadata are readable. The checks done by restore are less strict and the process is usually able to get far enough to retrieve data from the whole filesystem. This comes at a cost that some data might be incomplete or from older versions if they’re available.

There are several options to attempt restoration of various file metadata type. You can try a dry run first to see how well the process goes and use further options to extend the set of restored metadata.

I have 129G available on the boot disk so I can try to restore files to that drive.

I'm going to use "sdc" first, which is the heatly and original working drive. Then followed by "sdd" which is the disk that was partly replaced. The last disk "sdb" is useless.

# mkdir /restored-files
# umount /pub
# btrfs restore -D /dev/sdc /restored-files/
warning, device 3 is missing
checksum verify failed on 115867648 found E486C552 wanted 006578E4
bad tree block 115867648, bytenr mismatch, want=115867648, have=65536
Could not open root, trying backup super
warning, device 3 is missing
checksum verify failed on 38895616 found 69CF6F65 wanted 8D2CD2D3
checksum verify failed on 38895616 found 69CF6F65 wanted 8D2CD2D3
bad tree block 38895616, bytenr mismatch, want=38895616, have=65536
checksum verify failed on 115867648 found E486C552 wanted 006578E4
bad tree block 115867648, bytenr mismatch, want=115867648, have=65536
Could not open root, trying backup super
warning, device 3 is missing
checksum verify failed on 38895616 found 69CF6F65 wanted 8D2CD2D3
checksum verify failed on 38895616 found 69CF6F65 wanted 8D2CD2D3
bad tree block 38895616, bytenr mismatch, want=38895616, have=65536
checksum verify failed on 115867648 found E486C552 wanted 006578E4
bad tree block 115867648, bytenr mismatch, want=115867648, have=65536
Could not open root, trying backup super

# btrfs restore -D /dev/sdd /restored-files/
warning, device 3 is missing
checksum verify failed on 22020096 found 7DCD7CC1 wanted 28699DE8
checksum verify failed on 22020096 found 7DCD7CC1 wanted 28699DE8
bad tree block 22020096, bytenr mismatch, want=22020096, have=899525736547221204
ERROR: cannot read chunk root
Could not open root, trying backup super
warning, device 3 is missing
warning, device 1 is missing
bad tree block 22020096, bytenr mismatch, want=22020096, have=0
ERROR: cannot read chunk root
Could not open root, trying backup super
warning, device 3 is missing
warning, device 1 is missing
bad tree block 22020096, bytenr mismatch, want=22020096, have=0
ERROR: cannot read chunk root
Could not open root, trying backup super

Removing the useless disk in order to try to run on two disks only doesn't work because it is a RAID-5 which needs at least three disks:

# btrfs device remove missing 3 /pub
ERROR: error removing device 'missing': unable to go below two devices on raid5
ERROR: error removing devid 3: unable to go below two devices on raid5

Adding a new disk in order to try to have Btrfs re-balance fails as expected:

# btrfs balance start -v /pub/
Dumping filters: flags 0x7, state 0x0, force is off
  DATA (flags 0x0): balancing
  METADATA (flags 0x0): balancing
  SYSTEM (flags 0x0): balancing
WARNING:

        Full balance without filters requested. This operation is very
        intense and takes potentially very long. It is recommended to
        use the balance filters to narrow down the scope of balance.
        Use 'btrfs balance start --full-balance' option to skip this
        warning. The operation will start in 10 seconds.
        Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting balance without any filters.

The balance ends prematurely:

# dmesg
[ 1179.816473] BTRFS info (device sdc): balance: resume -dusage=90 -musage=90 -susage=90
[ 1179.816732] BTRFS info (device sdc): relocating block group 48524951552 flags data|raid5
[ 1180.074942] BTRFS info (device sdc): relocating block group 47451209728 flags metadata|raid5
[ 1180.391206] BTRFS info (device sdc): found 12 extents
[ 1180.632952] BTRFS info (device sdc): relocating block group 47384100864 flags system|raid5
[ 1180.894086] BTRFS info (device sdc): found 1 extents
[ 1181.132700] BTRFS info (device sdc): relocating block group 42988470272 flags data|raid5
[ 1211.850068] BTRFS info (device sdc): found 13 extents
[ 1213.063935] BTRFS error (device sdc): bad tree block start, want 65650688 have 13914138350834705721
[ 1213.072832] BTRFS: error (device sdc) in btrfs_run_delayed_refs:3011: errno=-5 IO failure
[ 1213.072834] BTRFS info (device sdc): forced readonly
[ 1213.072859] BTRFS info (device sdc): balance: ended with status: -30

I was actually very surprised at the number of files that I managed to salvage with Btrfs.

This means that either all the files, except the missing one, was located physically on that single healthy drive or parts of the files plus the needed parity data was all located on that single healthy drive plus the second drive that was partially replaced.

Does this mean that Btrfs perhaps isn't very good at balancing data and parity data evenly across multiple drives in a RAID-5 setup so that I ended up having most of the data needed on only one drive?

Or does this mean that with Btrfs sometimes you just "get lucky" and stand a greater chance at getting your files back even when two drives fail in a RAID-5 setup?

I decided to re-test this in order to see if I would get the same results again. This time by pulling the "sdc" disk which was healthy before. Of course I might just get the same results because Btrfs is now using another disk in the same way.

I have created a completely fresh RAID-5 pool and mounted it:

# mkfs.btrfs -f -m raid5 -d raid5 /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 /dev/disk/by-id/ata-ST31000340NS_9QJ0EZZC
btrfs-progs v4.20.2
See http://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:               226b366f-64f0-447e-87eb-31c91e5992b6
Node size:          16384
Sector size:        4096
Filesystem size:    2.73TiB
Block group profiles:
  Data:             RAID5             2.00GiB
  Metadata:         RAID5             2.00GiB
  System:           RAID5            16.00MiB
SSD detected:       no
Incompat features:  extref, raid56, skinny-metadata
Number of devices:  3
Devices:
   ID        SIZE  PATH
    1   931.51GiB  /dev/disk/by-id/ata-ST31000340NS_9QJ089LF
    2   931.51GiB  /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2
    3   931.51GiB  /dev/disk/by-id/ata-ST31000340NS_9QJ0EZZC

# mount -o noatime,compress=lzo /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /pub

Then I have transfered all the files from the client again:

# ls -gG /pub/tmp/
total 37223496
-rwxrw-r-- 1    18576345 Apr 21 09:08 1.pdf
-rwxrw-r-- 1    30255102 Apr 21 09:08 2.pdf
-rwxrw-r-- 1    22016195 Apr 21 09:08 3.pdf
-rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv
-rwxrw-r-- 1   625338368 Mar  5  2018 boo.iso
-rwxrw-r-- 1  1548841922 Apr 15 23:50 foo.mkv
-rwxrw-r-- 1   415633408 Mar  5  2018 moo.iso

# btrfs filesystem df /pub/
Data, RAID5: total=36.00GiB, used=35.51GiB
System, RAID5: total=16.00MiB, used=16.00KiB
Metadata, RAID5: total=2.00GiB, used=40.39MiB
GlobalReserve, single: total=40.20MiB, used=0.00B

I have now removed the device that was "sdc".

# btrfs filesystem show -d
warning, device 1 is missing
checksum verify failed on 85508096 found A0A8052D wanted 444BB89B
bad tree block 85508096, bytenr mismatch, want=85508096, have=65536
Couldn't read tree root
Label: none  uuid: 663b05c8-c9b3-4c88-a450-36b5e25a39c2
        Total devices 3 FS bytes used 35.55GiB
        devid    2 size 931.51GiB used 19.01GiB path /dev/sdb
        devid    3 size 931.51GiB used 19.01GiB path /dev/sdd
        *** Some devices missing

I am then mounting the Btrfs pool in a degraded state and beginning a replacement, then I will remove the next drive from the pool during the replacement:

# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 /pub/
# btrfs replace start -f 1 /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V /pub/
# btrfs replace status -1 /pub
0.2% done, 0 write errs, 0 uncorr. read errs

This time I am experiencing a crash:

# dmesg
[  581.184298] kernel BUG at fs/btrfs/raid56.c:1910!
[  581.184304] invalid opcode: 0000 [#3] PREEMPT SMP PTI
[  581.184309] CPU: 1 PID: 366 Comm: kworker/u8:0 Tainted: G      D   I       5.0.10-arch1-1-ARCH #1
[  581.184315] Hardware name: Hewlett-Packard HP Compaq dc7900 Small Form Factor/3031h, BIOS 786G1 v01.08 08/25/2008
[  581.184351] Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
[  581.184385] RIP: 0010:__raid_recover_end_io+0x37e/0x450 [btrfs]
[  581.184390] Code: 00 ff ff ff ff 85 c0 74 47 83 f8 02 0f 85 e3 00 00 00 48 83 c4 10 48 89 df 31 f6 5b 5d 41 5c 41 5d 41 5e 41 5f e9 f2 ee ff ff <0f> 0b 4c 8d a3 98 00 00 00 4c 89 e7 e8 51 73 f1 d7 f0 80 8b b0 00
[  581.184399] RSP: 0018:ffff9eb141347e18 EFLAGS: 00010213
[  581.184403] RAX: ffff92d37c72a800 RBX: ffff92d37f03d800 RCX: 0000000000000000
[  581.184408] RDX: 0000000000000002 RSI: 0000000000000010 RDI: 0000000000000003
[  581.184412] RBP: 0000000000000000 R08: 0000000000000008 R09: ffff92d391a0a000
[  581.184417] R10: 0000000000000008 R11: 000000000000000c R12: 0000000000000003
[  581.184426] R13: 0000000000000000 R14: 0000000000000001 R15: ffff92d384525e80
[  581.184435] FS:  0000000000000000(0000) GS:ffff92d393a80000(0000) knlGS:0000000000000000
[  581.184440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  581.184445] CR2: 000055fdae19b1c8 CR3: 000000020f3ca000 CR4: 00000000000406e0
[  581.184449] Call Trace:
[  581.184484]  normal_work_helper+0xbd/0x350 [btrfs]
[  581.184491]  process_one_work+0x1eb/0x410
[  581.184496]  worker_thread+0x2d/0x3d0
[  581.184501]  ? process_one_work+0x410/0x410
[  581.184506]  kthread+0x112/0x130
[  581.184511]  ? kthread_park+0x80/0x80
[  581.184516]  ret_from_fork+0x35/0x40
[  581.184521] Modules linked in: snd_hda_codec_analog i915 snd_hda_codec_generic ledtrig_audio kvmgt vfio_mdev mdev btrfs vfio_iommu_type1 vfio i2c_algo_bit snd_hda_intel drm_kms_helper snd_hda_codec coretemp drm snd_hda_core libcrc32c syscopyarea kvm snd_hwdep sysfillrect snd_pcm sysimgblt xor fb_sys_fops irqbypass snd_timer input_leds snd raid6_pq joydev tpm_infineon psmouse tpm_tis soundcore hp_wmi tpm_tis_core intel_agp sparse_keymap mei_wdt iTCO_wdt e1000e mei_me tpm intel_gtt iTCO_vendor_support rfkill pcspkr mei gpio_ich agpgart wmi_bmof evdev rng_core mac_hid lpc_ich wmi pcc_cpufreq acpi_cpufreq ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid sd_mod serio_raw uhci_hcd atkbd libps2 ahci libahci ata_generic pata_acpi libata ehci_pci ehci_hcd scsi_mod floppy i8042 serio

Also the replacement have stalled. So I rebooted the Btrfs machine.

Now, I cannot mount the filesystem:

# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 /pub/
mount: /pub: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.

I have tried btrfs device scan and to mount in "recovery" mode, and I have also tried using btrfs restore, and I have tried the btrfs rescue zero-log, but nothing worked.

Have I just now hit one of the RAID-5 bugs? The wiki do say:

The parity RAID code has multiple serious data-loss bugs in it. It should not be used for anything other than testing purposes.

The answer is actually no. Well, the crash is a bug, but the mount issue is not a bug.

The simple fact is that you cannot expect to survive a two-drive failure in a RAID-5 setup no matter what filesystem you are using.

Sometimes, as in the first attempt, you might get away with restoring some files. At other times you will simply lose the entire pool. Expect the later with both ZFS and Btrfs!

Enough Btrfs for now. Time to test mdadm+dm-integrity.

mdadm+dm-integrity RAID-5

UPDATE 2019-08-27: It has come to my attention (thank you Philip!) that I made an unfortunate mistake in my tests of mdadm+dm-integrity. When I tested for data integrity errors I wrote to /dev/mapper/sdb which also updates the dm-integrity checksum. Later when I do the sync-action check, the errors are not the dm-integrity checksum errors, but rather the RAID parity errors. The correct test should have been to write random data to /dev/sdb. At the end of the mdadm-dm+integrity section I have copy/pasted the result of a test Philip send me by email which contains an example of how the test should have been run. I have also updated the article with a note each time I made the mistake.

I stumbled upon dm-integrity as I was doing some of the tests with Btrfs and I haven't used it before. I therefore thought that it would be interesting to see how mdadm+dm-integrity handles the same problems that I have just tested ZFS and Btrfs with.

mdadm is used for administering pure software RAID using plain block devices, but it does not provide any kind of data integrity verification. If a read error is encountered with mdadm, the block in error is calculated and written back. If the pool is a mirror, as it can't calculate the correct data, it will take the data from the first available drive and assume it is correct and write it to the other drive. If it is a degraded RAID pool mdadm will terminate immediately without doing anything, as it cannot recalculate the faulty data.

The dm-integrity has nothing to do with any kind of RAID setup. dm-integrity will just return an EILSEQ (instead of EIO error) when it encounters a data integrity error, it is then up to the RAID driver, mdadm in this case, to handle integrity errors properly. dm-integrity does not require encryption when it is not desired by technical or other reasons.

From the documentation:

The dm-integrity target can also be used as a standalone target, in this mode it calculates and verifies the integrity tag internally. In this mode, the dm-integrity target can be used to detect silent data corruption on the disk or in the I/O path.

To guarantee write atomicity, the dm-integrity target uses journal, it writes sector data and integrity tags into a journal, commits the journal and then copies the data and integrity tags to their respective location.

If you combine dm-integrity with a mdadm RAID (RAID-1/mirror, RAID-5, or any other setup) you now have disk redundancy and error checking and error correction. dm-integrity will cause checksum errors when it encounters invalid data which mdadm notices and then repairs with correct data.

With mdadm, you specify the raid device to create, the raid mode level (raid0, raid1, raid10, raid5, raid6 etc) and the devices. mdadm is very well documented and it contains tons of options with examples as well, but it is also very easy to make mistakes with mdadm.

If you just want simple data integrity verification without any of the extra functionality that ZFS or Btrfs offers, then dm-integrity alone can do the job - you just need to run regular scrubs of the filesystem and then make sure you have adequate backup to handle any potential integrity problems.

Since I'm not using encryption in these tests I will use the integritysetup command instead of the cryptsetup command to format the disks. However, it is worth noticing that dm-integrity is best integrated with dm-crypt+LUKS for disk encryption.

By default, integritysetup uses "crc32" which is relatively fast and requiring just 4 bytes per block. This gives the probability of a random corruption not being detected of about 2^32. This is then on top of any silent corruption on a hard drive.

With dm-integrity the devices needs to be wiped during format in order to avoid invalid checksums. As this takes extremely long time with the 1 TB disks, I have changed the disks to three old 160 GB disks and I'm just gonna use the shorthand sdX for device names (never do that, I am just doing it for the sake of test, always use device names with serial numbers for easy identification).

# integritysetup format --integrity sha256 /dev/sdb

WARNING!
========
This will overwrite data on /dev/sdb irrevocably.

Are you sure? (Type uppercase yes): YES
WARNING: Device /dev/sdb already contains a 'dos' partition signature.
Formatted with tag size 4, internal integrity sha256.
Wiping device to initialize integrity checksum.
You can interrupt this by pressing CTRL+c (rest of not wiped device will contain invalid checksum).
Progress:   2.0%, ETA 49:12, 2991 MiB written, speed  50.3 MiB/s

Then opening the devices:

# integritysetup open --integrity sha256 /dev/sdb sdb
# integritysetup open --integrity sha256 /dev/sdc sdc
# integritysetup open --integrity sha256 /dev/sdd sdd

And creating the mdadm RAID-5 system:

# mdadm --create --verbose --assume-clean --level=5 --raid-devices=3 /dev/md/raid5 /dev/mapper/sdb /dev/mapper/sdc /dev/mapper/sdd
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 512K
mdadm: size set to 154882048K
mdadm: automatically enabling write-intent bitmap on large pool
mdadm: Defaulting to version 1.2 metadata
mdadm: pool /dev/md/raid5 started.

Then create the ext4 filesystem on top of that:

# mkfs.ext4 /dev/md/raid5

In the above I have just used the defaults and I didn't calculate the correct stripe width and stride for a mdadm RAID-5 setup.

Time to get some status information:

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 dm-2[2] dm-1[1] dm-0[0]
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

# mdadm --misc -D /dev/md/raid5
/dev/md/raid5:
           Version : 1.2
     Creation Time : Sat Apr 27 04:22:29 2019
        Raid Level : raid5
        Array Size : 309764096 (295.41 GiB 317.20 GB)
     Used Dev Size : 154882048 (147.71 GiB 158.60 GB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Apr 27 04:24:23 2019
             State : clean
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : testbox:raid5  (local to host testbox)
              UUID : 43262b9e:9d039a22:539d7bbf:be558b32
            Events : 2

    Number   Major   Minor   RaidDevice State
       0     254        0        0      active sync   /dev/dm-0
       1     254        1        1      active sync   /dev/dm-1
       2     254        2        2      active sync   /dev/dm-2

Then just to validate dm-integrity is setup correctly (here just on the sdb disk):

# ls /sys/block/md127/integrity/
device_is_integrity_capable

# ls /sys/block/sdb/integrity/
device_is_integrity_capable

# ls /sys/block/dm-0/integrity/
device_is_integrity_capable

# dmsetup info /dev/mapper/sdb
Name:              sdb
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      254, 0
Number of targets: 1
UUID: CRYPT-INTEGRITY-sdb
...

So I am up and running with a mdadm RAID-5 setup with dm-integrity.

I have mounted the mdadm device on /pub and it's time to transfer some files from the client using rsync:

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/
1.pdf
     18,576,345 100%  245.62MB/s    0:00:00 (xfr#1, to-chk=6/8)
2.pdf
     30,255,102 100%   72.86MB/s    0:00:00 (xfr#2, to-chk=5/8)
3.pdf
     22,016,195 100%   27.41MB/s    0:00:00 (xfr#3, to-chk=4/8)
bar.mkv
 35,456,180,485 100%   30.32MB/s    0:18:35 (xfr#4, to-chk=3/8)
boo.iso
         32,768   0%  744.19kB/s    0:14:00
    625,338,368 100%    5.70MB/s    0:01:44 (xfr#5, to-chk=2/8)
foo.mkv
  1,548,841,922 100%   56.08MB/s    0:00:26 (xfr#6, to-chk=1/8)
moo.iso
    415,633,408 100%    7.12MB/s    0:00:55 (xfr#7, to-chk=0/8)

Number of files: 8 (reg: 7, dir: 1)
Number of created files: 8 (reg: 7, dir: 1)
Number of deleted files: 0
Number of regular files transferred: 7
Total file size: 38,116,841,825 bytes
Total transferred file size: 38,116,841,825 bytes
Literal data: 38,116,841,825 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 38,126,148,151
Total bytes received: 202

sent 38,126,148,151 bytes  received 202 bytes  28,938,253.02 bytes/sec
total size is 38,116,841,825  speedup is 1.00

This was painfully slow, but these are some very old 2.5" 5400 RPM laptop disks I'm am using.

Compared to the ZFS RAID-Z transfer:

sent 38,126,148,150 bytes  received 202 bytes  106,945,717.68 bytes/sec

And the Btrfs RAID-5 transfer:

sent 38,126,148,151 bytes  received 202 bytes  102,078,041.11 bytes/sec

During the transfer top showed dm-integrity working with "md127_raid5" occasionally spiking to 100% cpu usage:

%CPU   %MEM  TIME+   COMMAND
31.2   0.0   1:42.90 md127_raid5
 9.0   0.3   0:19.28 smbd
 2.0   0.0   0:00.19 kworker/u8:1+dm-integrity-wait
 1.3   0.0   0:00.89 kworker/1:22+dm-integrity-writer
 1.0   0.0   0:00.58 kworker/1:28+dm-integrity-writer
 0.7   0.0   0:04.09 kworker/u8:0+flush-9:127
 0.7   0.0   0:02.51 kworker/u8:2-dm-integrity-wait
 0.7   0.0   0:00.59 kworker/1:17-dm-integrity-metadata
 0.3   0.0   0:05.31 kworker/0:1H-kblockd
 0.3   0.0   0:00.19 kworker/1:30-dm-integrity-metadata
 0.3   0.0   0:00.20 kworker/1:31+dm-integrity-commit

mdadm - Power outage

I have then again added the file zoo.mkv to the files on the client and will begin the rsync transfer and then pull the power cord at about 50%.

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/
sending incremental file list
zoo.mkv
 5,887,590,400  38%   91.19kB/s   20:22:4
^C

The power cord has been pulled. The transfer aborted on the client and the mdadm machine has been powered back up. Because I'm using dm-integrity I have to remember to open up the devices:

# integritysetup open --integrity sha256 /dev/sdb sdb
# integritysetup open --integrity sha256 /dev/sdc sdc
# integritysetup open --integrity sha256 /dev/sdd sdd

Now I can take a look at the state of the system:

$ dmesg
[   82.332815] md/raid:md127: not clean -- starting background reconstruction
[   82.332838] md/raid:md127: device dm-2 operational as raid disk 2
[   82.332839] md/raid:md127: device dm-1 operational as raid disk 1
[   82.332840] md/raid:md127: device dm-0 operational as raid disk 0
[   82.333329] md/raid:md127: raid level 5 active with 3 out of 3 devices, algorithm 2

It's working. The filesystem was shut down in an unclean fashion and top shows a kworker process busy cleaning up:

$ top -n 1
%CPU  $MEM  TIME+   COMMAND
6.2   0.0   0:00.58 kworker/1:1H-kblockd
0.0   0.1   0:00.53 systemd
0.0   0.0   0:00.00 kthreadd
0.0   0.0   0:00.00 rcu_gp
0.0   0.0   0:00.00 rcu_par_gp
0.0   0.0   0:00.08 kworker/0:0-dm-integrity-metadata
0.0   0.0   0:00.00 kworker/0:0H-kblockd

mdstat however doesn't show anything useful:

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 dm-2[2] dm-1[1] dm-0[0]
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

Rather, with mdadm we need to pay close attention to dmesg or the log:

# dmesg
[  190.074041] EXT4-fs (md127): recovery completed

So now I can mount the RAID-5 and take a look at the directory:

# mount /dev/md/raid5 /pub
$ ls -gG /pub
-rwxrw-r-- 1    18576345 Apr 21 09:08 1.pdf
-rwxrw-r-- 1    30255102 Apr 21 09:08 2.pdf
-rwxrw-r-- 1    22016195 Apr 21 09:08 3.pdf
-rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv
-rwxrw-r-- 1   625338368 Mar  5  2018 boo.iso
-rwxrw-r-- 1  1548841922 Apr 15 23:50 foo.mkv
-rwxrw-r-- 1   415633408 Mar  5  2018 moo.iso
-rwxrw-r-- 1           0 Apr 27 19:51 .zoo.mkv.pH62pr

Traces of the broken rsync transfer remains in the directory.

mdadm - Drive failure

Time to simulate the drive failure during the transfer of the zoo.mkv file. I will again remove a drive from the pool and see how md-integrity+mdadm handles that.

$ dmesg
[  865.320920] ata4: SATA link down (SStatus 0 SControl 300)
[  865.320926] ata4.00: disabled
[  865.320942] sd 3:0:0:0: rejecting I/O to offline device
[  865.320945] print_req_error: I/O error, dev sdb, sector 712 flags 801
[  865.320951] print_req_error: I/O error, dev sdb, sector 712 flags 801
[  865.320956] device-mapper: integrity: Error on writing journal: -5
[  865.320966] sd 3:0:0:0: rejecting I/O to offline device
[  865.320968] print_req_error: I/O error, dev sdb, sector 131455112 flags 0
[  865.320979] sd 3:0:0:0: rejecting I/O to offline device
[  865.320981] print_req_error: I/O error, dev sdb, sector 131455120 flags 0
[  865.320982] md: super_written gets error=10
[  865.320986] md/raid:md127: Disk failure on dm-0, disabling device.
               md/raid:md127: Operation continuing on 2 devices.
[  865.321003] md/raid:md127: read error not correctable (sector 130306048 on dm-0).
[  865.321012] md/raid:md127: read error not correctable (sector 130306056 on dm-0).
[  865.321019] md/raid:md127: read error not correctable (sector 130306064 on dm-0).
[  865.321026] md/raid:md127: read error not correctable (sector 130306072 on dm-0).
[  865.321033] md/raid:md127: read error not correctable (sector 130306080 on dm-0).
[  865.321040] md/raid:md127: read error not correctable (sector 130306088 on dm-0).
[  865.321047] md/raid:md127: read error not correctable (sector 130306096 on dm-0).
[  865.321054] md/raid:md127: read error not correctable (sector 130306104 on dm-0).
[  865.321061] md/raid:md127: read error not correctable (sector 130306112 on dm-0).
[  865.321063] ata4.00: detaching (SCSI 3:0:0:0)
[  865.321069] md/raid:md127: read error not correctable (sector 130306120 on dm-0).
[  865.323206] sd 3:0:0:0: [sdb] Synchronizing SCSI cache
[  865.323245] sd 3:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  865.323247] sd 3:0:0:0: [sdb] Stopping disk
[  865.323258] sd 3:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_O

dmesg shows unrecoverable read errors.

Let's look at mdadm:

# mdadm --misc -D /dev/md/raid5
/dev/md/raid5:
           Version : 1.2
     Creation Time : Sat Apr 27 19:15:28 2019
        Raid Level : raid5
        Array Size : 309764096 (295.41 GiB 317.20 GB)
     Used Dev Size : 154882048 (147.71 GiB 158.60 GB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Apr 27 20:07:24 2019
             State : active, degraded
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 1
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : testbox:raid5  (local to host testbox)
              UUID : c6183471:4e732124:11d110d8:1acd5593
            Events : 135

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1     254        1        1      active sync   /dev/dm-1
       2     254        2        2      active sync   /dev/dm-2

       0     254        0        -      faulty   /dev/dm-0

mdstat also shows the drive as missing (there should be three U's):

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 dm-2[2] dm-1[1] dm-0[0](F)
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
      bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>

Now I need to figure out which device dm-0 is, just so I can keep track (this is where you always need to make sure you use the serial number of the device in order to keep physical track too):

$ ls -l /dev/disk/by-id/
dm-name-sdb -> ../../dm-0

I haven't got any room for a spare device so I'll mark the device as failed and remove it:

#  mdadm --manage /dev/md/raid5 --fail /dev/dm-0
mdadm: set /dev/dm-0 faulty in /dev/md/raid5

#  mdadm --manage /dev/md/raid5 --remove /dev/dm-0
mdadm: hot removed /dev/dm-0 from /dev/md/raid5

# mdadm --misc -D /dev/md/raid5
...
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1     254        1        1      active sync   /dev/dm-1
       2     254        2        2      active sync   /dev/dm-

Then I'll shutdown the machine, attach a new disk, format it, and bring that into the pool as a replacement.

Looking at the "by-id" I can verify that the new drive I have attached haven't messed up the device mapping.

# ls -l /dev/disk/by-id/
ata-ST9160821AS_5MA7BFKV -> ../../sdb
ata-ST9160821AS_5MA7BFKV-part1 -> ../../sdb1

The new drive already contains a filesystem so I need to format and wipe it:

# integritysetup format --integrity sha256 /dev/sdb

WARNING!
========
This will overwrite data on /dev/sdb irrevocably.

Are you sure? (Type uppercase yes): YES
WARNING: Device /dev/sdb already contains a 'dos' partition signature.
Formatted with tag size 4, internal integrity sha256.
Wiping device to initialize integrity checksum.
You can interrupt this by pressing CTRL+c (rest of not wiped device will contain invalid checksum).
Progress:   0.3%, ETA 65:30,  509 MiB written, speed  38.4 MiB/s

Then once it's done I open the other drives:

# integritysetup open --integrity sha256 /dev/sdc sdc
# integritysetup open --integrity sha256 /dev/sdd sdd

With only the two working drives mdadm shows the dm-0 is missing:

# mdadm --misc -D /dev/md/raid5
...
    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1     254        1        1      active sync   /dev/dm-1
       2     254        2        2      active sync   /dev/dm-2

Then I open sdb and attach it to the RAID:

# integritysetup open --integrity sha256 /dev/sdb sdb
# mdadm --add /dev/md/raid5 /dev/mapper/sdb
mdadm: added /dev/mapper/sdb

Now mdadm brings the RAID-5 pool into sync:

$ cat /proc/mdstat
md127 : active raid5 dm-0[3] dm-2[2] dm-1[1]
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
      [>....................]  recovery =  0.3% (498968/154882048) finish=165.0min speed=15592K/sec
      bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>

ZFS implements a very sophisticated block tracking mechanism so it is capable of knowing exactly which blocks it needs to reconstruct. This means that ZFS only reconstructs the used blocks and it is extremely fast at reconstructing - especially if the disks doesn't contain much data. Btrfs is slower than ZFS, but it also only reconstructs the used blocks.

mdadm on the other hand reconstructs every single block on the disk, including the non-used ones, which makes the reconstruction process extremely slow even on small disks.

Eventually mdadm is done:

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 dm-0[3] dm-2[2] dm-1[1]
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

And dmesg confirms this:

[17741.968738] md: md127: recovery done.

This took a very long time.

mdadm - Drive failure during file transfer

Now that the RAID-5 pool has been restored I'm going to remove a drive during a file transfer and then put it back in, and see how mdadm+dm-integrity behaves.

During the transfer the log shows that the drive is gone:

kernel: print_req_error: I/O error, dev sdd, sector 203305096 flags 0
kernel: device-mapper: integrity: Error on reading tags: -5
kernel: sd 6:0:0:0: rejecting I/O to offline device
kernel: print_req_error: I/O error, dev sdd, sector 46430728 flags 0
kernel: sd 6:0:0:0: [sdd] tag#2 CDB: Read(10) 28 00 02 c4 7a 08 00 00 80 00
kernel: sd 6:0:0:0: [sdd] tag#2 Add. Sense: Unaligned write command
kernel: sd 6:0:0:0: [sdd] tag#2 Sense Key : Illegal Request [current]
kernel: sd 6:0:0:0: [sdd] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: ata7.00: disabled
kernel: ata7: SATA link down (SStatus 0 SControl 300)

mdadm reacted by halting the file transfer for about a second, then resumed the transfer without the client being able to notice anything other than the momentary drop in the file transfer speed.

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/
sending incremental file list
zoo.mkv
 10,867,033,488 100%   35.52MB/s    0:04:51 (xfr#3, to-chk=0/9)

Number of files: 9 (reg: 8, dir: 1)
Number of created files: 1 (reg: 1)
Number of deleted files: 0
Number of regular files transferred: 3
Total file size: 48,983,875,313 bytes
Total transferred file size: 10,919,304,785 bytes
Literal data: 10,919,304,785 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 10,921,970,962
Total bytes received: 76

sent 10,921,970,962 bytes  received 76 bytes  27,969,196.00 bytes/sec
total size is 48,983,875,313  speedup is 4.48

In the ZFS and Btrfs tests I always shutdown the machine before I re-attached the removed drive, but this time I have just re-attached the drive. mdadm shows the drive as missing:

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 dm-0[3] dm-2[2](F) dm-1[1]
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
      bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>

# mdadm --misc -D /dev/md/raid5
...
    Number   Major   Minor   RaidDevice State
       3     254        0        0      active sync   /dev/dm-0
       1     254        1        1      active sync   /dev/dm-1
       -       0        0        2      removed

       2     254        2        -      faulty   /dev/dm-2

So I'll have to remove the device from mdadm, close the device, then open it again and re-attach it:

# mdadm --manage /dev/md/raid5 --remove /dev/dm-2
mdadm: hot removed /dev/dm-2 from /dev/md/raid5

# mdadm --misc -D /dev/md/raid5
...
    Number   Major   Minor   RaidDevice State
       3     254        0        0      active sync   /dev/dm-0
       1     254        1        1      active sync   /dev/dm-1
       -       0        0        2      removed

$ ls -l /dev/disk/by-id/
 dm-name-sdd -> ../../dm-2

# integritysetup close sdd
# integritysetup open --integrity sha256 /dev/sdd sdd
# mdadm --manage /dev/md/raid5 --re-add /dev/dm-2
mdadm: re-added /dev/dm-2

mdadm automatically begins the recovery process:

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 dm-2[2] dm-0[3] dm-1[1]
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
      [==>..................]  recovery = 14.9% (23132796/154882048) finish=184.9min speed=11872K/sec
      bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>

# mdadm --misc -D /dev/md/raid5
...
    Number   Major   Minor   RaidDevice State
       3     254        0        0      active sync   /dev/dm-0
       1     254        1        1      active sync   /dev/dm-1
       2     254        2        2      spare rebuilding   /dev/dm-2

As with ZFS and Btrfs any mdadm pool should be "scrubbed" at regular intervals.

On mdadm this basically involves reading the entire pool, such that any problems with the drive will trigger a read error and auto-correction, and any problems with the data will be picked up. It's controlled by writing to the "sync_action" parameter in /sys:

# echo check > /sys/block/md127/md/sync_action
# dmesg
[20520.024219] md: data-check of RAID pool md127

Notice the difference from the above mdstat "recovery" message to the "check" message:

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 dm-2[2] dm-0[3] dm-1[1]
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      [===>.................]  check = 17.1% (26578480/154882048) finish=121.1min speed=17642K/sec
      bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>

Again, compared to both ZFS and Btrfs this took a very long time.

mdadm - Data corruption during file transfer

Time to simulate the disk corruption in the middle of a file transfer from the client. I have removed the "zoo.mkv" file and while the rsync command is running I will use the dd command multiple times on the mdadm machine on one of the drives:

# dd if=/dev/urandom of=/dev/mapper/sdb seek=3000 count=30 bs=1k

UPDATE 2019-08-27: This is where I made the mistake. Writing to /dev/mapper/sdb also updates the dm-integrity checksum. The correct test should have been to write random data to /dev/sdb.

At first I misunderstood the behavior of dm-integrity and I was expecting to see something like this in the log:

device-mapper: INTEGRITY AEAD ERROR, sector 39784

But I didn't.

According to the DMintegrity documentation I should be seeing an integrity failures count, but that would of course require that I was hitting a sector that was being read from! Which wasn't the case in my example.

So I need to run a scrub:

# echo check > /sys/block/md127/md/sync_action

Mismatching checksums are now found:

# tail -f | dmesg
[  980.626850] md: data-check of RAID pool md127
[  982.490908] md127: mismatch sector in range 38120-38128
[  982.490911] md127: mismatch sector in range 38128-38136
[  982.490913] md127: mismatch sector in range 38144-38152
[  982.490914] md127: mismatch sector in range 38152-38160
[  982.490916] md127: mismatch sector in range 38160-38168
[  982.490917] md127: mismatch sector in range 38168-38176
[  982.490918] md127: mismatch sector in range 38048-38056
[  982.490919] md127: mismatch sector in range 38056-38064
[  982.490922] md127: mismatch sector in range 38064-38072
[  982.490923] md127: mismatch sector in range 38072-38080

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 dm-2[1] dm-1[2] dm-0[3]
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      [=>...................]  check =  7.0% (10876736/154882048) finish=139.0min speed=17257K/sec
      bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>

And after a very long time mdadm has fixed the pool:

# dmesg
[13943.389533] md: md127: data-check done.

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 dm-2[1] dm-1[2] dm-0[3]
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

# mdadm --misc -D /dev/md/raid5
/dev/md/raid5:
           Version : 1.2
     Creation Time : Sat Apr 27 19:15:28 2019
        Raid Level : raid5
        Array Size : 309764096 (295.41 GiB 317.20 GB)
     Used Dev Size : 154882048 (147.71 GiB 158.60 GB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue Apr 30 03:42:55 2019
             State : clean
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : testbox:raid5  (local to host testbox)
              UUID : c6183471:4e732124:11d110d8:1acd5593
            Events : 4454

    Number   Major   Minor   RaidDevice State
       3     254        0        0      active sync   /dev/dm-0
       1     254        2        1      active sync   /dev/dm-2
       2     254        1        2      active sync   /dev/dm-1

I can mount the pool on the client and access the files:

$ ls -gG tmp
total 47835856
-rwxr-xr-x 1    18576345 Apr 21 09:08 1.pdf
-rwxr-xr-x 1    30255102 Apr 21 09:08 2.pdf
-rwxr-xr-x 1    22016195 Apr 21 09:08 3.pdf
-rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv
-rwxr-xr-x 1   625338368 Mar  5  2018 boo.iso
-rwxr-xr-x 1  1548841922 Apr 15 23:50 foo.mkv
-rwxr-xr-x 1   415633408 Mar  5  2018 moo.iso
-rwxr-xr-x 1 10867033488 Apr 22 21:10 zoo.mkv

mdadm - The dd mistake

Time to see what happens if the sysadmin by mistake issues the dd command on one of the RAID drives. Again I have removed the "zoo.mkv" file and I am doing this during an active file transfer:

# dd if=/dev/urandom of=/dev/mapper/sdb bs=1M
^C89745+0 records in
89745+0 records out
91898880 bytes (92 MB, 88 MiB) copied, 26.8636 s, 3.4 MB/s

UPDATE 2019-08-27: This is where I made the mistake again. Writing to /dev/mapper/sdb also updates the dm-integrity checksum. The correct test should have been to write random data to /dev/sdb.

Not as much as in the tests with ZFS or Btrfs, but this should still suffice.

Nothing noticeable happened on the client. The files keeps going:

$ rsync -a --progress --stats /home/naim/tmp/ /home/naim/
sending incremental file list
./
zoo.mkv
 10,223,976,448  94%   31.22MB/s    0:00:20

At this point in the tests ZFS had detected checksum errors and the message "One or more devices has experienced an unrecoverable error". With mdadm there isn't really anything to go by until you run a scrub:

# echo check > /sys/block/md127/md/sync_action
# tail -f | dmesg
[18709.603919] md: data-check of RAID pool md127
[18710.296237] md127: mismatch sector in range 152-160
[18710.296240] md127: mismatch sector in range 144-152
[18710.296242] md127: mismatch sector in range 136-144
[18710.296243] md127: mismatch sector in range 128-136
[18710.296245] md127: mismatch sector in range 120-128
[18710.296248] md127: mismatch sector in range 112-120
[18710.296249] md127: mismatch sector in range 104-112
[18710.296251] md127: mismatch sector in range 96-104
[18710.296261] md127: mismatch sector in range 88-96
[18710.296263] md127: mismatch sector in range 80-88
[18715.299379] handle_parity_checks5: 25825 callbacks suppressed
[18715.299381] md127: mismatch sector in range 195112-195120
[18715.299652] md127: mismatch sector in range 195096-195104
[18715.299836] md127: mismatch sector in range 195176-195184
[18715.299890] md127: mismatch sector in range 195168-195176
[18715.299945] md127: mismatch sector in range 195152-195160
[18715.300001] md127: mismatch sector in range 195144-195152
[18715.300048] md127: mismatch sector in range 195192-195200
[18715.300106] md127: mismatch sector in range 195160-195168
[18715.300143] md127: mismatch sector in range 195136-195144
[18715.300180] md127: mismatch sector in range 195184-195192
[18720.302862] handle_parity_checks5: 24559 callbacks suppressed
...

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 dm-2[1] dm-1[2] dm-0[3]
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      [>....................]  check =  3.0% (4786740/154882048) finish=125.7min speed=19891K/sec
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

After the scrub is finally done:

# dmesg
[31789.569052] md: md127: data-check done

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 dm-2[1] dm-1[2] dm-0[3]
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

# mdadm --misc -D /dev/md/raid5
/dev/md/raid5:
           Version : 1.2
     Creation Time : Sat Apr 27 19:15:28 2019
        Raid Level : raid5
        Array Size : 309764096 (295.41 GiB 317.20 GB)
     Used Dev Size : 154882048 (147.71 GiB 158.60 GB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon Apr 29 00:41:18 2019
             State : clean
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : testbox:raid5  (local to host testbox)
              UUID : c6183471:4e732124:11d110d8:1acd5593
            Events : 3457

    Number   Major   Minor   RaidDevice State
       3     254        0        0      active sync   /dev/dm-0
       1     254        2        1      active sync   /dev/dm-2
       2     254        1        2      active sync   /dev/dm-1

I did this test multiple times on the mdadm system. The second time I did it everything was fixed at this point and the pool continued to run and I could mount and use the filesystem. But the first time I did the test things didn't go as smooth! I got mount errors:

# mount /dev/md/raid5 /pub/
mount: /pub: wrong fs type, bad option, bad superblock on /dev/md127, missing codepage or helper program, or other error.

I tried rebooting the machine, re-open the drives, and then assemble, but the problem persisted.

# mdadm --stop /dev/md/raid5
mdadm: stopped /dev/md/raid5

# mdadm --assemble --scan
mdadm: /dev/md/raid5 has been started with 3 drives.

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md127 : active raid5 dm-0[3] dm-1[2] dm-2[1]
      309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

# mount /dev/md/raid5 /pub/
mount: /pub: wrong fs type, bad option, bad superblock on /dev/md127, missing codepage or helper program, or other error.

The drives are all registered as clean:

# mdadm --examine /dev/mapper/sdb /dev/mapper/sdc /dev/mapper/sdd
/dev/mapper/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : c6183471:4e732124:11d110d8:1acd5593
           Name : testbox:raid5  (local to host testbox)
  Creation Time : Sat Apr 27 19:15:28 2019
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 309764392 (147.71 GiB 158.60 GB)
     Array Size : 309764096 (295.41 GiB 317.20 GB)
  Used Dev Size : 309764096 (147.71 GiB 158.60 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=296 sectors
          State : clean
    Device UUID : 84c226cb:7d10c97d:8820a12f:24e28510

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Apr 29 00:41:18 2019
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : e4546d90 - correct
         Events : 3457

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/mapper/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : c6183471:4e732124:11d110d8:1acd5593
           Name : testbox:raid5  (local to host testbox)
  Creation Time : Sat Apr 27 19:15:28 2019
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 309764392 (147.71 GiB 158.60 GB)
     Array Size : 309764096 (295.41 GiB 317.20 GB)
  Used Dev Size : 309764096 (147.71 GiB 158.60 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=296 sectors
          State : clean
    Device UUID : 924b5275:8d6bee5c:1d049510:60f6103f

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Apr 29 00:41:18 2019
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : 7d24497d - correct
         Events : 3457

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
/dev/mapper/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : c6183471:4e732124:11d110d8:1acd5593
           Name : testbox:raid5  (local to host testbox)
  Creation Time : Sat Apr 27 19:15:28 2019
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 309764392 (147.71 GiB 158.60 GB)
     Array Size : 309764096 (295.41 GiB 317.20 GB)
  Used Dev Size : 309764096 (147.71 GiB 158.60 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=296 sectors
          State : clean
    Device UUID : 121eecb0:459a731c:9a45a4e9:b42b46d7

Internal Bitmap : 8 sectors from superblock
    Update Time : Mon Apr 29 00:41:18 2019
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : e987c188 - correct
         Events : 3457

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)

This is clearly a filesystem error and it seems like I might be dealing with The Bad Block Controversy.

From the documentation I get the following options:

The first thing is to try and check the integrity of your file system. A command like "tar cf / > /dev/null" will read the entire file system and tell you if any files are unreadable. It should also clear any Bad Blocks that have data on them but are recoverable. However, this is a known bug - that doesn't always happen.

But the Bad Blocks may be on an unallocated portion of the file system. If you wish to clear that, try a command like "cat /dev/zero > /tempfile &; rm /tempfile". This will fill all your spare disk space with zeroes, then delete the file it used to do so.

After both these things have been done, your Bad Blocks list should be empty. However, both these commands are very disk-heavy, and will take a very long time on a modern pool. Plus the code is strongly suspected to be buggy so these commands could very likely not work.

If you are satisfied that everything is okay, and you don't want the Bad Blocks functionality, the easy way to get rid of it (if you have no Bad Blocks list to clear) is "mdadm ... --assemble --update=no-bbl".

If, however, you do have an active Bad Blocks list with sectors in it, this command won't work. You can use the command "mdadm ... --assemble --update=force-no-bbl" to delete the list, but this will now mean that mdadm will probably return garbage where before it failed with an error. If you're satisfied that your file system is intact, though, this won't matter to you.

In my specific case none of the above are going to work and the only way forward is to actually clean the ext4 filesystem and hope for the best:

# fsck /dev/md/raid5
...
Inode 523773 seems to contain garbage.  Clear? yes
Inode 523774 seems to contain garbage.  Clear? yes
Inode 523775 seems to contain garbage.  Clear? yes
...

# top
%CPU   %MEM  TIME+   COMMAND
23.3   0.1   0:18.20 fsck.ext4
 3.3   0.0   0:00.58 kworker/0:28-dm-integrity-metadata
 3.0   0.0   0:01.51 kworker/1:5-dm-integrity-metadata
 3.0   0.0   0:00.66 kworker/0:25-dm-integrity-metadata
 2.7   0.0   0:01.34 kworker/0:6-dm-integrity-metadata
 2.7   0.0   0:00.80 kworker/1:19-dm-integrity-metadata
 2.7   0.0   0:00.75 kworker/1:34+dm-integrity-metadata
...

/dev/md127: ***** FILE SYSTEM WAS MODIFIED *****
/dev/md127: 21/19365888 files (14.3% non-contiguous), 13453202/77441024 blocks

Tons and tons of errors. I will now mount the pool and check the files:

# mount /dev/md/raid5 /pub
# cd /pub
# ls -a
.   ..   lost+found

The tmp directory is gone!

# cd lost+found
# du -h
4.0K    ./#11
46G     ./#12058625

# cd \#12058625/

# ls -la
total 47835856
-rwxrw-r-- 1    18576345 Apr 21 09:08 1.pdf
-rwxrw-r-- 1    30255102 Apr 21 09:08 2.pdf
-rwxrw-r-- 1    22016195 Apr 21 09:08 3.pdf
-rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv
-rwxrw-r-- 1   625338368 Mar  5  2018 boo.iso
-rwxrw-r-- 1  1548841922 Apr 15 23:50 foo.mkv
-rwxrw-r-- 1   415633408 Mar  5  2018 moo.iso
-rwxrw-r-- 1 10867033488 Apr 22 21:10 zoo.mkv

However, the files are still there.

I decided to stop testing mdadm+dm-integrity here as the last test with the failure of the second drive during a restoration would be rather pointless and very time consuming.

A correct test of mdadm+dm-integrity

As mentioned in the beginning, when I tested for data integrity errors I wrote to /dev/mapper/sdb which also updates the dm-integrity checksum. Later when I did the sync-action check, the errors displayed weren't the dm-integrity checksum errors, but rather the RAID 5 parity errors. The correct test should have been to write random data to /dev/sdb.

The test below was sent to me by Philip who was kind enough to write to me about the mistake I made, and also provide an example of a test where then random data is written to the correct device.

# integritysetup format --integrity sha256 /dev/nvme0n1
# integritysetup format --integrity sha256 /dev/nvme0n2
# integritysetup format --integrity sha256 /dev/nvme0n3
# integritysetup open --integrity sha256 /dev/nvme0n1 nvme0n1
# integritysetup open --integrity sha256 /dev/nvme0n2 nvme0n2
# integritysetup open --integrity sha256 /dev/nvme0n3 nvme0n3
# mdadm --create --verbose --level=5 --raid-devices=3 /dev/md0 /dev/mapper/nvme0n1 /dev/mapper/nvme0n2 /dev/mapper/nvme0n3
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 dm-6[3] dm-5[1] dm-4[0]
      8253440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3]
[UUU]

unused devices: <none>

# mkfs.xfs /dev/md0
# mount /dev/md0 /mnt/md0
# dd if=/dev/zero of=/mnt/md0/test.data bs=1G count=7
7+0 records in
7+0 records out
7516192768 bytes (7.5 GB, 7.0 GiB) copied, 156.89 s, 47.9 MB/s

# sha256sum test.data
5e002ad6567bbdc9d43cc140cc509a592838457a56df571cd203230c7a56f241 test.data

# dd if=/dev/urandom of=/dev/nvme0n1 seek=3000 bs=1k
^C641763+0 records in
641762+0 records out
657164288 bytes (657 MB, 627 MiB) copied, 20.8391 s, 31.5 MB/s

# sha256sum test.data
5e002ad6567bbdc9d43cc140cc509a592838457a56df571cd203230c7a56f241 test.data

...
Aug 26 23:03:16 localhost.localdomain kernel: device-mapper:
integrity: Checksum failed at sector 0x12faf0
Aug 26 23:03:16 localhost.localdomain kernel: device-mapper:
integrity: Checksum failed at sector 0x12faf8
Aug 26 23:03:16 localhost.localdomain kernel: device-mapper:
integrity: Checksum failed at sector 0x12fb00
Aug 26 23:03:16 localhost.localdomain kernel: device-mapper:
integrity: Checksum failed at sector 0x12fb08
Aug 26 23:03:16 localhost.localdomain kernel: device-mapper:
integrity: Checksum failed at sector 0x12fb10
Aug 26 23:03:16 localhost.localdomain kernel: device-mapper:
integrity: Checksum failed at sector 0x12fb18
Aug 26 23:03:16 localhost.localdomain kernel: device-mapper:
integrity: Checksum failed at sector 0x12fb20
Aug 26 23:03:16 localhost.localdomain kernel: device-mapper:
integrity: Checksum failed at sector 0x12fb28
Aug 26 23:03:16 localhost.localdomain kernel: device-mapper:
integrity: Checksum failed at sector 0x12fb30
Aug 26 23:03:16 localhost.localdomain kernel: device-mapper:
integrity: Checksum failed at sector 0x12fb38
Aug 26 23:03:16 localhost.localdomain kernel: device-mapper:
integrity: Checksum failed at sector 0x12fb40
Aug 26 23:03:16 localhost.localdomain kernel: device-mapper:
integrity: Checksum failed at sector 0x12fb48
...

# sha256sum test.data
5e002ad6567bbdc9d43cc140cc509a592838457a56df571cd203230c7a56f241 test.data

no dm integrity error output

Thank you very much for sharing this Philip!

Final notes

With mdadm you can setup any kind of RAID pool for disk redundancy and you can even expand with further disks if later needed. Then you can add dm-integrity for error detection and error correction at the block level. If you add dm-crypt+LUKS to that the data integrity is protected with native authenticated encryption. And if you use the RAID pool as LVM physical volumes you can do snapshots too. But it cannot by a long shot compare to ZFS or Btrfs!

And I would never use something like mdadm+dm-integrity as a replacement for ZFS or Btrfs.

It is true that you don't have to worry about compiling a kernel module every time the kernel is updated (as with ZFS on Linux), or using annoying Solaris compatibility layers, or to run with an external and un-official and often out-of-sync repository (Arch Linux developers, yes I'm talking to you! We need ZFS in the Arch Linux official repositories ASAP! Even Debian has ZFS in its repositories!). And you don't need to worry about any "write hole" issues as with Btrfs.

However, not only do you need to carefully study the documentation of each piece of technology you put together with mdadm and make sure you understand how you put these things together and how you best deal with potential problems, but you're still limited by the "regular" filesystem you put on top of all that, and you don't get any of really well designed and superior protection or management that ZFS or Btrfs provides.

Let me quote Allan Jude and Michael W. Lucas from their book "FreeBSD Mastery: ZFS":

ZFS is merely a filesystem, yes. But it includes features that many filesystems can't even contemplate.

ZFS is a copy-on-write filesystem that is extremely well designed and it is light years ahead of Btrfs. ZFS is also very easy to use. Yes, you are allowed to shoot yourself in the foot with ZFS, this is *NIX after all, and if you don't plan ahead you can also end up with a big mess, but then it is mostly your own fault. ZFS is very well documented, but with ZFS you almost know by intuition how a command needs to be constructed.

Another big advantage with ZFS is that it extremely reliable and very well battle tested in the industry. You'll find thousands of companies and regular people that have been using ZFS for a very long time. This means that it is much easier to get help if you need it.

Btrfs is also a copy-on-write filesystem and I believe it was supposed to be the "better ZFS" on Linux, but even though companies like Facebook has deployed Btrfs on millions of servers they only care about the specific functionally they need, which means that the RAID5/6 "write hole" issue still isn't fixed even after so many years!

Another concern with Btrfs is that a many people have reported serious issues about data loss with Btrfs and the Debian Linux Wiki contains many relevant and unresolved issues (as of writing) to be alarmed about.

Some such reports (not the ones on the Debian wiki) are the result of a mis-managed situation. Very often people deploy solutions without study and without testing. Then when errors happen they manage to blow everything up (by the trial-and-error approach), make things un-recoverable, and then blame the technology.

But many people - as in this example - have also reported only positive results with Btrfs. Also OpenSUSE is a Linux distribution sponsored by SUSE Linux GmbH and several other companies and it runs with Btrfs as the default filesystem for the root partition. And Synology is a company founded in 2000 that creates network-attached storage (NAS), IP surveillance solutions, and network equipment. Their NAS storage solutions are based upon the Btrfs filesystem.

Also even though ZFS is as mature as it is, it is still undergoing rapid and active development, with many new features continuously added. It has received a huge amount of work since the cooperation between all the different projects in OpenZFS and especially from the ZFS on Linux project. So much so that the FreeBSD developers decided to re-base their ZFS filesystem code on the "ZFS On Linux" port rather than the Illumos code where they originally had been acquiring the code from. This also means that while ZFS is getting new features added it is also experiencing new bugs and issues.

Personally I really love ZFS and it is without a doubt my favorite filesystem! It's an absolute amazing feat of engineering. But I also hope very much that Btrfs eventually will catch up as it has been improved a lot upon lately, and I don't think it deserves all the bad press it has gotten. I have deployed Btrfs in multiple setups without any issues and it has performed really well too.

Update 2020-01-23: I have since I wrote this article abandoned Btrfs completely. I have found no situation where it was viable to run Btrfs rather than ZFS on FreeBSD or even on GNU/Linux.

Anyway, I hope you have found this article worth the read.