Battle testing ZFS, Btrfs and mdadm+dm-integrity
Published on 2019-05-05. Modified on 2020-01-23.
In this article I share the results of a home-lab experiment in which I threw some different problems at ZFS, Btrfs and mdadm+dm-integrity in a RAID-5 setup.
Table of Contents
- Introduction
- Myths and misunderstandings
- Some advice
- ZFS RAID-Z
- Btrfs RAID-5
- mdadm+dm-integrity RAID-5
- Final notes
- Relevant reading
Introduction
Let me start by saying that this is a simple write up and it wasn't originally intended to be anything but some personal notes which I then decided to share.
I did my tests on and off during the course of about a week, but I have tried to be consistent. I have repeated many of the tests more than once, but did not document everything as the results where very similar.
My main interest was to see how the different systems would handle multiple breakdown situations in a RAID-5 setup. I also tested mirror (RAID-1) setups, but due to the length of the article I later decided not to include those.
I have used the word "pool" from the world of ZFS and Btrfs whenever I am dealing with the RAID-5 array.
Please forgive any short comings, missing parts, and mistakes in my attempt to make this write up. Also English is not my native language.
Now, on to the subject.
Whenever we use any kind of precautionary measures against data corruption, such as backup and/or filesystem data integrity verification, we need to test our setup with at least some simulated failures before we implement a solution. If we never "battle test" our solution, we have no real idea how it's going to handle a breakdown.
We need to ask questions like:
- If my system breaks down right now do I have adequate measures in place or will I lose important data?
- If I do have backup in place, will my backup suffice? Is it recent enough? Is it secure enough?
- What if my backup solution breaks during restoration? Do I need multiple backup solutions?
- What about bit rot?
- Do I need running data integrity verification?
- Do I need to backup everything or can I perhaps split data into important and non-important categories?
- Do I need to automate some of the procedures?
- Have I tested my solution?
- Have I tested my solution?
- Have I tested my solution?
Yes, we really need to test our solutions throughly :)
In any case, ZFS and Btrfs are both very amazing open source data integrity verification filesystems.
I have more experience using ZFS, and the last time I tested Btrfs, it was not performing well. File transfers where slow and a situation did occur where I lost some files. However, this was a very long time ago. I have since looked through the Btrfs source code and commit logs and Btrfs has received many fixes and improvements - especially during the last couple of years.
I therefore decided to put up a simple home-test environment with bare metal and throw some simulated problems against both ZFS and Btrfs and then try to deal with the problems in an as-identical-as-possible manner on both systems. Later I added mdadm+dm-integrity.
I managed to get a long with all the systems using only their respective man pages even though I think the Btrfs documentation could benefit a lot by using some examples.
I used old and cheap hardware suitable for a home-lab.
The computer I used has only 4 SATA-II connectors and I decided to use one for the boot device itself. I then used the rest for a RAID-5 (RAID-Z in ZFS) with just three hard drives. I could have booted from a USB stick and then used four drives, but I wanted to speed up both the installation time and boot time.
A RAID-5 requires 3 or more physical drives. RAID-5 stores parity blocks distributed across each disk. In the event of a failed disk, these parity blocks are used to reconstruct the data on a replacement disk. RAID-5 can withstand the loss of one disk.
I know some people "frown" upon RAID-5, but a RAID-5 is a really great way to utilize both speed and space. In any case no kind of RAID setup is a replacement for proper backup. If your data is important to you, you should always back it up.
On the picture below I have setup two identical computers. During the main testing I always used the same machine and hardware, but for extensive and repeated testing I put the second machine to work too.
Both computers are equipped with an Intel Core 2 Duo CPU E8400 3.00 Ghz CPU with 8 GB of memory and an Intel Pro 1000 PT PCIe x1 Gigabit NIC. All the hard drives are some really amazing but old 1 TB Seagate Barracude ES.2 drives from 2010 (with the latest firmware) that all have been through quite a lot of "beating" over the years. I believe I have about ten of these drives and (if I remember correctly) only one drive has failed, about a year ago. The rest is still going strong.
For ZFS I ran Debian Linux "Stretch" with kernel version 4.19.28 and zfs-dkms version 0.7.12-1 from backports. For Btrfs I ran Arch Linux with kernel version 5.0.9 and Btrfs version 4.20.2. For both ZFS and Btrfs I used Samba. In my experience Samba performs better than NFS even between Linux-only machines and even though NFS uses less resources, I prefer Samba for various reasons.
At some point during my testing with Btrfs I discovered dm-integrity. I therefore decided to setup a RAID-5 with mdadm+dm-integrity on the Arch Linux installation and repeat the tests.
During the process I sometimes jumped back and forth between different tests on the different systems. For example, I first tested ZFS, then repeated the tests with Btrfs, then began testing mdadm+dm-integrity, then went back and performed some more tests on both ZFS and Btrfs, etc. The article is therefor put together from the various tests and the date and time in the different terminal outputs don't always match up. I also sometimes changed the disks in my setup so the disk IDs occasionally change. Please ignore that.
Myths and misunderstandings
One thing that really bothers me is how much false information that exists on the Internet regarding both ZFS and Btrfs.
Some misinformation has been spread due to inexperience, wrong expectations, and/or misunderstandings about the usage of these systems.
Let's get some of the myths and misunderstandings out of the way:
- Myth: ZFS requires tons of memory!
This is one of the biggest misunderstandings about ZFS. The only situation in which ZFS requires lots of memory is if you specifically use de-duplication. I have run ZFS successfully using FreeBSD 12 on a Raspberry Pi 3 with two 1 TB USB disks attached to a single USB 3 hub. ZFS never used more than half the memory available during any kind of procedure and you can change the settings so it even runs with much less than that. - Myth: Red Hat has removed Btrfs because they consider it useless!
No, that is not why Red Hat removed Btrfs. A former Red Hat developer explains the situation on Hacker News. - Myth: ZFS and Btrfs requires ECC memory!
ZFS or Btrfs without ECC memory is no worse than any other file system without ECC memory. Using ECC memory is recommended in situations where the strongest data integrity guarantees are required. Random bit flips caused by cosmic rays or by faulty memory can go undetected without ECC memory. Any filesystem will write the damaged data from memory to disk and be unable to automatically detect the corruption. Also note that ECC memory is often not supported by consumer grade hardware. And ECC memory is also more expensive. In any way you can run ZFS and Btrfs without using ECC memory, it's not a requirement. - Myth: Restoring a RAID-5 puts more stress on the drives!
Drives are not stressed! It's their job to read and write data! You are using your drives, not stressing them. It takes longer to restore a RAID-5 because the parity data needs to be calculated using CPU and it is slower than simply copying data between disks in a mirror (RAID-1), but there is no stress involved. - Myth: Using USB disk devices with ZFS or Btrfs is okay!
Sometimes you can get away with it without any problems what so ever, but many USB controllers and USB storage devices are really bad. If things break you cannot blame the filesystem. On Btrfs a Parent transid verify failed error is often the result of a failed internal consistency check of the filesystem's metadata due to a bad USB storage device. Other issues such as automatic and sudden un-mounting, wrong file size, data corruption, sudden shutdown, and several other problems are often caused by a bad USB storage device and/or USB power issues. - Myth: Btrfs still has the write hole issue and is completely useless!
The myth part of this is that Btrfs is completely useless, not the problems with the write hole issue. As of writing Btrfs still has some issues, but it is definitely not useless and you can even run RAID5/6 if you take some specific precautions. Check the RAID5/6 information. The "write hole" problem with Btrfs only potentially exist if you experience a power loss (an unclean shutdown) while having a disk that is failing immediately thereafter (or possibly at the same time) - without running a scrub in between. These two distinct failures combined breaks the Btrfs RAID-5 redundancy. However I was not able to reproduce the problem in any of my many tests with Btrfs. Update 2020-01-23: People have been emailing me with examples of the write hole problem persisting, where they have lost data, even in the Btrfs version of the 5.x kernel. - Myth: Btrfs is abandoned!
Btrfs is used in production world wide. Btrfs is deployed by Facebook on millions of servers with significant effiency gains. And it is also used by many other companies and projects and Btrfs keeps getting better and better. - Myth: mdadm+XYZ can replace ZFS or Btrfs!
No. They don't even compare.
Some advice
Most data loss reported on the mailing lists of ZFS, Btrfs, and mdadm, is down to user error while attempting to recover a failed array. Never use a trial-and-error approach when something goes wrong with your filesystem or backup solution!
Very often a really bad situation is caused by a trial-and-error approach to a problem. With Btrfs many people immediately use the btrfs check --repair
command when they experience an issue, but this is actually the very last command you want to run.
Understand what you can expect from the filesystem you're using, how it works, and how each system implements a specific functionality. Don't blame the filesystem when it doesn't fulfill your wrong expectations.
ZFS RAID-Z
Let's begin the testing with ZFS.
The three disks are listed by "by-id" and I'll create the ZFS pool using those ID's as they also contain the serial number which makes it very easy to identify each drive.
$ ls -gG /dev/disk/by-id/ ata-ST31000340NS_9QJ089LF -> ../../sdd ata-ST31000340NS_9QJ0EQ1V -> ../../sdb ata-ST31000340NS_9QJ0F2YQ -> ../../sdc
With a RAID-Z (RAID-5) I can stand to lose one drive and the pool will still function, however I need to "resilver" the pool as soon as possible with a replacement drive.
Resilvering is the same concept as rebuilding a RAID array. With most other RAID implementations, there is no distinction between which blocks are in use, and which aren't. A typical rebuild therefore starts at the beginning of the disk until it reaches the end of the disk - this is how mdadm works and it is extremely slow. But because ZFS knows about structure of the RAID system and the metadata, ZFS rebuilds only the blocks in use. The ZFS developers therefore thought of the term "resilvering" rather than "rebuilding".
I'm going to create a pool using the -f
option because ZFS will detect that the attached drives used to belong to an old pool and will not allow for it to be used in a new pool unless forced to do so (I have used the drives in a previous setup).
# zpool create -f -O xattr=sa -O dnodesize=auto -O atime=off -o ashift=12 pool1 raidz ata-ST31000340NS_9QJ0F2YQ ata-ST31000340NS_9QJ0EQ1V ata-ST31000340NS_9QJ089LF
I'm then going to create a ZFS dataset on the pool with lz4 compression enabled.
# zfs create -o compress=lz4 pool1/pub # zfs list NAME USED AVAIL REFER MOUNTPOINT pool1 575K 1.75T 128K /pool1 pool1/pub 128K 1.75T 128K /pool1/pub
I have then exported the "pub" directory using Samba and will begin by copying some files over from a client computer using rsync
.
$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ 1.pdf 18,576,345 100% 196.49MB/s 0:00:00 (xfr#1, to-chk=6/8) 2.pdf 30,255,102 100% 70.89MB/s 0:00:00 (xfr#2, to-chk=5/8) 3.pdf 22,016,195 100% 23.28MB/s 0:00:00 (xfr#3, to-chk=4/8) bar.mkv 35,456,180,485 100% 112.92MB/s 0:04:59 (xfr#4, to-chk=3/8) boo.iso 625,338,368 100% 21.64MB/s 0:00:27 (xfr#5, to-chk=2/8) foo.mkv 1,548,841,922 100% 135.76MB/s 0:00:10 (xfr#6, to-chk=1/8) moo.iso 415,633,408 100% 25.86MB/s 0:00:15 (xfr#7, to-chk=0/8) Number of files: 8 (reg: 7, dir: 1) Number of created files: 8 (reg: 7, dir: 1) Number of deleted files: 0 Number of regular files transferred: 7 Total file size: 38,116,841,825 bytes Total transferred file size: 38,116,841,825 bytes Literal data: 38,116,841,825 bytes Matched data: 0 bytes File list size: 0 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 38,126,148,150 Total bytes received: 202 sent 38,126,148,150 bytes received 202 bytes 106,945,717.68 bytes/sec total size is 38,116,841,825 speedup is 1.0
Now the ZFS pool has some data:
# zfs list NAME USED AVAIL REFER MOUNTPOINT pool1 35.5G 1.72T 128K /pool1 pool1/pub 35.5G 1.72T 35.5G /pool1/pub
ZFS - Power outage
I'll then add yet another file using rsync
and then pull the power cord to the ZFS machine half way through the transfer.
I have then aborted the rest of the file transfer on the client and turned the ZFS machine back on.
$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list zoo.mkv 5,918,261,248 54% 64.88kB/s 21:11:16 ^C
Because ZFS is using transactional transfers the file is going to be lost, but nothing has happened to the files already on the system, and there will be no kind of damage to the filesystem and no kind of filesystem checking needs to be run.
Let's take a look at the ZFS documentation from ORACLE regarding the Transactional Semantics:
ZFS is a transactional file system, which means that the file system state is always consistent on disk. Traditional file systems overwrite data in place, which means that if the system loses power, for example, between the time a data block is allocated and when it is linked into a directory, the file system will be left in an inconsistent state. Historically, this problem was solved through the use of the fsck command. This command was responsible for reviewing and verifying the file system state, and attempting to repair any inconsistencies during the process. This problem of inconsistent file systems caused great pain to administrators, and the fsck command was never guaranteed to fix all possible problems. More recently, file systems have introduced the concept of journaling. The journaling process records actions in a separate journal, which can then be replayed safely if a system crash occurs. This process introduces unnecessary overhead because the data needs to be written twice, often resulting in a new set of problems, such as when the journal cannot be replayed properly.
With a transactional file system, data is managed using copy on write semantics. Data is never overwritten, and any sequence of operations is either entirely committed or entirely ignored. Thus, the file system can never be corrupted through accidental loss of power or a system crash. Although the most recently written pieces of data might be lost, the file system itself will always be consistent. In addition, synchronous data (written using the O_DSYNC flag) is always guaranteed to be written before returning, so it is never lost.
This is confirmed by a look at the status of the pool:
# zpool status pool: pool1 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0F2YQ ONLINE 0 0 0 ata-ST31000340NS_9QJ0EQ1V ONLINE 0 0 0 ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 errors: No known data errors # zfs list NAME USED AVAIL REFER MOUNTPOINT pool1 35.5G 1.72T 128K /pool1 pool1/pub 35.5G 1.72T 35.5G /pool1/pub
And from the clients point of view:
$ ls -gG mnt/testbox/pub/tmp total 37194300 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf -rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv -rwxr-xr-x 1 625338368 Mar 5 2018 boo.iso -rwxr-xr-x 1 1548841922 Apr 15 23:50 foo.mkv -rwxr-xr-x 1 415633408 Mar 5 2018 moo.iso
ZFS - Drive failure
Now I want to simulate a simple drive failure. I'm going to remove one of the drives from the ZFS machine, then replace it with another drive, and then resilver the ZFS pool.
I have removed the drive:
# zpool status pool: pool1 state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: none requested config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ0F2YQ ONLINE 0 0 0 1803500998269517419 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0EQ1V-part1 ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 errors: No known data errors # zfs list NAME USED AVAIL REFER MOUNTPOINT pool1 35.5G 1.72T 128K /pool1 pool1/pub 35.5G 1.72T 35.5G /pool1/pub
Even though the pool is in a degraded state, I can still mount the pool on the client and use the files.
$ mount mnt/testbox/pub $ ls -gG mnt/testbox/pub/tmp total 37194300 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf -rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv -rwxr-xr-x 1 625338368 Mar 5 2018 boo.iso -rwxr-xr-x 1 1548841922 Apr 15 23:50 foo.mkv -rwxr-xr-x 1 415633408 Mar 5 2018 moo.iso
I can also write to the pool.
$ echo Hello > mnt/testbox/pub/tmp/hello.txt $ ls -gG mnt/testbox/pub/tmp/ total 37194304 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf -rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv -rwxr-xr-x 1 625338368 Mar 5 2018 boo.iso -rwxr-xr-x 1 1548841922 Apr 15 23:50 foo.mkv -rwxr-xr-x 1 6 Apr 24 23:11 hello.txt -rwxr-xr-x 1 415633408 Mar 5 2018 moo.iso
Now I need to identify the new drive:
$ ls -l /dev/disk/by-id/ ata-ST31000340NS_9QJ0DVN2 -> ../../sdb
Then I need to replace the old drive with the new. The procedure, since the old drive is completely gone, is not to detach and then replace, but simply to replace with zpool replace pool old_device new_device
.
# zpool replace pool1 ata-ST31000340NS_9QJ0EQ1V ata-ST31000340NS_9QJ0DVN2
ZFS will immediately and automatically begin the resilvering of the pool:
# zpool status pool: pool1 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Apr 24 23:19:33 2019 10.5G scanned out of 53.3G at 228M/s, 0h3m to go 3.49G resilvered, 19.68% done config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ0F2YQ ONLINE 0 0 0 replacing-1 DEGRADED 0 0 0 1803500998269517419 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0EQ1V-part1 ata-ST31000340NS_9QJ0DVN2 ONLINE 0 0 0 (resilvering) ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 errors: No known data errors
After about 3 minutes the pool is back up and ready for usage:
# zpool status pool: pool1 state: ONLINE scan: resilvered 17.8G in 0h3m with 0 errors on Wed Apr 24 23:22:56 2019 config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0F2YQ ONLINE 0 0 0 ata-ST31000340NS_9QJ0DVN2 ONLINE 0 0 0 ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 errors: No known data errors
And from the client:
$ ls -gG mnt/testbox/pub/tmp total 37194304 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf -rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv -rwxr-xr-x 1 625338368 Mar 5 2018 boo.iso -rwxr-xr-x 1 1548841922 Apr 15 23:50 foo.mkv -rwxr-xr-x 1 6 Apr 24 23:11 hello.txt -rwxr-xr-x 1 415633408 Mar 5 2018 moo.iso
Just to make sure all data has been resilvered without any errors during writing I'll perform a scrub and validate that everything is alright:
# zpool scrub pool1
And about 3 minutes after the scrub is finished:
# zpool status pool: pool1 state: ONLINE scan: scrub repaired 0B in 0h3m with 0 errors on Thu Apr 24 23:56:01 2019 config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0F2YQ ONLINE 0 0 0 ata-ST31000340NS_9QJ0DVN2 ONLINE 0 0 0 ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 errors: No known data errors
Since ZFS has only restored the used data blocks, not the entire disk, the procedure was very was as was the scrubbing.
ZFS - Drive failure during file transfer
Now I want to remove a disk in the middle of an active file transfer in order to simulate a total failure of a disk, but not a permanent failure. This might happen if the disk power cord managed to wiggle itself loose, or if the disk is located in a slot and hasn't been pushed all the way through, etc.
$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list zoo.mkv 10,867,033,488 100% 127.95MB/s 0:01:20 (xfr#1, to-chk=0/9) Number of files: 9 (reg: 8, dir: 1) Number of created files: 1 (reg: 1) Number of deleted files: 0 Number of regular files transferred: 1 Total file size: 48,983,875,313 bytes Total transferred file size: 10,867,033,488 bytes Literal data: 10,867,033,488 bytes Matched data: 0 bytes File list size: 0 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 10,869,686,827 Total bytes received: 38 sent 10,869,686,827 bytes received 38 bytes 102,062,787.46 bytes/sec total size is 48,983,875,313 speedup is 4.5
I removed the drive by disconnection the individual power cord to the drive. The ZFS machine reacted by halting the file transfer for about a second, then it resumed with full speed and the client only experienced a momentary drop in the file transfer speed.
The file transfer was then completed without any problems on the client side.
On the ZFS machine the pool has now changed the state to DEGRADED:
# zpool status pool: pool1 state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: none requested config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 0 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC UNAVAIL 0 0 0 errors: No known data error
I powered down the machine in order to safely reattach the drive and then rebooted.
ZFS has detected the error:
# zpool status pool: pool1 state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: none requested config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 0 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 11 errors: No known data errors
This situation simulates a physical drive failure when the ZFS pool is under active use and it is probably one of the most common situations in real life.
In order to handle the problem correctly I would normally need to investigate the situation.
- Has the drive physically failed and therefore needs a replacement?
- Or is it perhaps a wire that has managed to wiggle itself loose?
- Or is it perhaps the wire itself that is broken?
- Or has the disk connector (both on the disk itself and on the motherboard) experienced any physical corrosion? (This actually happens).
It's important to remember that whether a disk is good or bad is not a simple yes or no question. A disk can be "mostly" good, with a few physical sectors that give errors. A disk can be bad for a few seconds, hours, or days, and then go back to working fine again for years.
Due to firmware issues, a disk may be able to do most operations fine, but certain operations don't work well. Disk problems are shaded, multi-dimensional and time-dependent!
Now, since this is just a simulation I know what to do, but in a real life situation you need to investigate the above questions as any of the above issues might be the cause of the problem.
If there isn't any physical problems with the setup you might be able to get some useful information from S.M.A.R.T.
In my situation I have determined that the problem was caused by a system administrator who managed to pull the power cord from the disk "by mistake" so I don't need to replace the drive :)
The correct approach is therefore to do a scrub after the drive has been reattached. During a scrub ZFS will detect any checksum errors and will restore the data using the parity data.
# zpool scrub pool1 # zpool status pool: pool1 state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub repaired 4.19G in 0h6m with 0 errors on Fri Apr 26 01:32:23 2019 config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 0 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC DEGRADED 0 0 67.2K too many errors errors: No known data error
After the scrubbing is done ZFS tells us that it has repaired 4.19GB of data with 0 errors.
Even though ZFS has managed to repair everything without any errors it still keeps the pool in a degraded state because it is up to the system administrator to decide what needs to be done. This is important because even though ZFS has managed to rescue all data we might still be dealing with am unhealthy device.
Had there been any unrecoverable errors during the scrubbing we would be facing a disk that is too damaged for ZFS to continue working with it.
Can we clear the log and bring the pool status into the ONLINE and healthy state? Or do we need to replace the drive anyway? Perhaps S.M.A.R.T has warned us that the drive is now currently working, but it is experiencing occasional issues and soon needs to be fully replaced.
In this case we know that the drive is working fine so I'll just clear the log:
# zpool clear pool1 # zpool status pool: pool1 state: ONLINE scan: scrub repaired 0B in 0h3m with 0 errors on Fri Apr 26 02:09:44 2019 config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 0 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 0 errors: No known data error
As a side note I can mention that I have been working a lot with tons of hardware over the past 25+ years and I have seen several situations in which S.M.A.R.T has reported problems with drives that still went on going for many years after being reported as both old and worn out. Of course you cannot ignore such reports, but depending on the situation you might need to replace the drive, but it can still be used in a less important capacity.
ZFS - Data corruption during file transfer
Now I want to simulate data corruption in the middle of a file transfer from the client. Not a drive failure, but some corruption of the data located on the pool.
I have removed the "zoo.mkv" file and while the rsync
command is running again I'll do a couple of dd
commands on the ZFS machine on one of the drives.
# dd if=/dev/urandom of=/dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V seek=100000 count=1000 bs=1k
While the transfer is still running, I'm checking the pool status:
# zpool status pool: pool1 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 1 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 0 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 0 errors: No known data errors
ZFS shows a checksum issue which has been fixed. dmesg and the log currently doesn't provide any further information. But we can take a look at the zpool events -v
command if we want further information:
# zpool events -v Apr 26 2019 23:05:59.990726744 ereport.fs.zfs.checksum class = "ereport.fs.zfs.checksum" ena = 0x18549e4f2ec00401 detector = (embedded nvlist) version = 0x0 scheme = "zfs" pool = 0x4cdea36f1d7afa7c vdev = 0x772f5157f66ae182 (end detector) pool = "pool1" pool_guid = 0x4cdea36f1d7afa7c pool_state = 0x0 pool_context = 0x0 pool_failmode = "wait" vdev_guid = 0x772f5157f66ae182 vdev_type = "disk" vdev_path = "/dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1" vdev_ashift = 0xc vdev_complete_ts = 0x18549e3862a vdev_delta_ts = 0x19c5648 vdev_read_errors = 0x0 vdev_write_errors = 0x0 vdev_cksum_errors = 0x0 parent_guid = 0x3a01d1f81d93aaf8 parent_type = "raidz" vdev_spare_paths = vdev_spare_guids = zio_err = 0x34 zio_flags = 0x100080 zio_stage = 0x100000 zio_pipeline = 0xf80000 zio_delay = 0x0 zio_timestamp = 0x0 zio_delta = 0x0 zio_offset = 0x5cb6000 zio_size = 0x6000 zio_objset = 0x48 zio_object = 0x82 zio_level = 0x1 zio_blkid = 0x0 bad_ranges = 0x0 0x6000 bad_ranges_min_gap = 0x8 bad_range_sets = 0xcaa5 bad_range_clears = 0xb597 bad_set_histogram = 0x32c 0x32b 0x334 0x312 0x334 0x31c 0x306 0x31f 0x300 0x340 0x303 0x30d 0x330 0x318 0x324 0x2f0 0x304 0x32b 0x314 0x33c 0x339 0x2fd 0x33c 0x347 0x33c 0x379 0x33f 0x324 0x327 0x351 0x310 0x313 0x31f 0x31c 0x31e 0x334 0x354 0x32e 0x33e 0x312 0x32d 0x369 0x340 0x337 0x32a 0x330 0x32c 0x33a 0x319 0x328 0x30a 0x332 0x32a 0x320 0x333 0x333 0x34b 0x316 0x347 0x30c 0x34c 0x35a 0x34a 0x2ff bad_cleared_histogram = 0x2cc 0x2c3 0x2e2 0x2ca 0x29c 0x2fa 0x2f8 0x2d0 0x2e6 0x2cd 0x2d5 0x2c3 0x2bf 0x2d7 0x2d7 0x2fa 0x2c8 0x2d4 0x2d1 0x303 0x2ef 0x2fa 0x2f4 0x2c1 0x2a3 0x2b7 0x2b4 0x2e9 0x2e6 0x2c9 0x2d9 0x2eb 0x2c1 0x2b9 0x2e4 0x2d7 0x2c0 0x2ff 0x2c7 0x2dc 0x2e8 0x2bc 0x2c7 0x2d8 0x2ed 0x2db 0x2db 0x318 0x2e8 0x2c8 0x2db 0x2da 0x2de 0x2f7 0x2d0 0x2e6 0x2ae 0x2fb 0x2ca 0x2d5 0x2a9 0x2d2 0x2e2 0x2aa time = 0x5cc372b7 0x3b0d4a58 eid = 0x1f
ZFS events have never been publicly documented, but we do know from the above output that we have had some bad bits cleared out and that everything is in perfect order.
ZFS - The dd mistake
Have you ever made the mistake of running the rm -rf
command as the root user on the / path of your disk? Or even worse what about the dd
command?
I want to extent the above test and see what's going to happen if I by mistake type the dd
command and let it run for a while during a file transfer from the client.
I have deleted all the files, restarted rsync
, and I am now letting the dd
run:
# dd if=/dev/urandom of=/dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V bs=1k ^C348001+0 records in 348001+0 records out 356353024 bytes (356 MB, 340 MiB) copied, 47.1212 s, 7.6 MB/s
This should make a big mess of things.
Nothing noticeable has happened on the client:
$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list ./ 1.pdf 18,576,345 100% 178.63MB/s 0:00:00 (xfr#1, to-chk=7/9) 2.pdf 30,255,102 100% 76.33MB/s 0:00:00 (xfr#2, to-chk=6/9) 3.pdf 22,016,195 100% 28.68MB/s 0:00:00 (xfr#3, to-chk=5/9) bar.mkv 14,681,931,776 41% 112.62MB/s 0:03:00
ZFS has detected the errors:
# zpool status pool: pool1 state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: none requested config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 1 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 0 errors: No known data error
This shows the absolute incredible and unmatchable resilience of ZFS. Even though I just started doing a dd
on one of the drives, the filesystem keeps working and clients can still read and write from the pool.
All I need to do is to perform a scrub to fix the problems:
# zpool scrub pool1 # zpool status pool: pool1 state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub in progress since Tue Apr 30 01:35:56 2019 2.24G scanned out of 68.5G at 209M/s, 0h5m to go 28K repaired, 3.28% done config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 9 (repairing) ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 0 errors: No known data errors
And the result:
# zpool status pool: pool1 state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub repaired 60K in 0h3m with 0 errors on Tue Apr 30 01:39:50 2019 config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ0ES1V DEGRADED 0 0 17 too many errors ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 0 errors: No known data errors
Again we have to investigate in order to determine if the disk that has suffered checksum errors needs to be replaced or if we can simply clear the log.
ZFS has managed to repair everything with 0 errors and all the disks are back up and working fine, I'll clear the log:
# zpool clear pool1 # zpool status pool: pool1 state: ONLINE scan: scrub repaired 0B in 0h3m with 0 errors on Tue Apr 30 01:59:34 2019 config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 0 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 0 errors: No known data errors
ZFS - A second drive failure during a replacement
The most dredded situation in any RAID-5 setup is that a second drive fails during a restoration of the pool.
Let's see what's going to happen.
I have created a new pool with three disks and have transfered all the files from the client to the pool.
On the client:
# ls -gG /pool1/pub/tmp/ total 47803477 -rwxrw-r-- 1 18576345 Apr 21 09:08 1.pdf -rwxrw-r-- 1 30255102 Apr 21 09:08 2.pdf -rwxrw-r-- 1 22016195 Apr 21 09:08 3.pdf -rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv -rwxrw-r-- 1 625338368 Mar 5 2018 boo.iso -rwxrw-r-- 1 1548841922 Apr 15 23:50 foo.mkv -rwxrw-r-- 1 415633408 Mar 5 2018 moo.iso -rwxrw-r-- 1 10867033488 Apr 22 21:10 zoo.mkv
On the ZFS machine:
# zfs list NAME USED AVAIL REFER MOUNTPOINT pool1 45.6G 1.71T 128K /pool1 pool1/pub 45.6G 1.71T 45.6G /pool1/pub
I have then removed one of the drives from the pool to simulate the first break down:
# zpool status pool: pool1 state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: none requested config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 ata-ST31000340NS_9QJ0DVN2 ONLINE 0 0 0 1368416530025724573 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1 errors: No known data errors
I am going to begin a replace procedure and while the resilvering of the new drive is running I am goint to disconnect one of the working drives.
# zpool replace -f pool1 ata-ST31000340NS_9QJ0ES1V ata-ST31000340NS_9QJ0EQ1V
Let's check the status:
# zpool status pool: pool1 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri May 3 23:23:37 2019 1.19G scanned out of 68.5G at 101M/s, 0h11m to go 404M resilvered, 1.74% done config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 ata-ST31000340NS_9QJ0DVN2 ONLINE 0 0 0 replacing-2 DEGRADED 0 0 0 1368416530025724573 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1 ata-ST31000340NS_9QJ0EQ1V ONLINE 0 0 0 (resilvering) errors: No known data errors
The resilvering is running and I am now disconnecting a second drive by removing the power cord for the drive. ZFS had any time to fully resilver the drive.
# zpool status pool: pool1 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri May 3 23:23:37 2019 11.3G scanned out of 68.5G at 138M/s, 0h7m to go 2.38G resilvered, 16.53% done config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 22.2K raidz1-0 DEGRADED 0 0 44.5K ata-ST31000340NS_9QJ089LF DEGRADED 0 0 0 too many errors ata-ST31000340NS_9QJ0DVN2 UNAVAIL 0 0 0 replacing-2 DEGRADED 0 0 0 1368416530025724573 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1 ata-ST31000340NS_9QJ0EQ1V ONLINE 0 0 0 (resilvering) errors: 22768 data errors, use '-v' for a list
The resilvering has run its course, but could not finish. ZFS not only informs us about the problem, but it also informs us about the files that are now unrecoverable.
# zpool status -v pool: pool1 state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: resilvered 2.38G in 0h7m with 235402 errors on Fri May 3 23:30:48 2019 config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 230K raidz1-0 DEGRADED 0 0 461K ata-ST31000340NS_9QJ089LF DEGRADED 0 0 0 too many errors ata-ST31000340NS_9QJ0DVN2 UNAVAIL 0 0 0 replacing-2 DEGRADED 0 0 0 1368416530025724573 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1 ata-ST31000340NS_9QJ0EQ1V ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /pool1/pub/tmp/bar.mkv /pool1/pub/tmp/foo.mkv /pool1/pub/tmp/moo.iso pool1/pub:<0x24> pool1/pub:<0x25> /pool1/pub/tmp/boo.iso /pool1/pub/tmp/zoo.mkv
In this situation trying to run any kind of repair process would not only be futile, but it would also be wrong. The filesystem itself isn't damaged and it doesn't require any kind of repairing.
The question is: What can we do the get as much data back from the broken pool as possible?
Let's run a scrub and see if by any change we can salvage some files and then restore as much of the pool as possible:
# zpool scrub pool1
Let's check:
# zpool status -v pool: pool1 state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 0B in 0h1m with 1277 errors on Fri May 3 23:40:12 2019 config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 235K raidz1-0 DEGRADED 0 0 479K ata-ST31000340NS_9QJ089LF DEGRADED 0 0 0 too many errors ata-ST31000340NS_9QJ0DVN2 UNAVAIL 0 0 0 replacing-2 DEGRADED 0 0 0 1368416530025724573 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1 ata-ST31000340NS_9QJ0EQ1V ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /pool1/pub/tmp/bar.mkv /pool1/pub/tmp/foo.mkv /pool1/pub/tmp/moo.iso pool1/pub:<0x24> pool1/pub:<0x25> /pool1/pub/tmp/boo.iso /pool1/pub/tmp/zoo.mkv
As expected this was a no go, you cannot do a scrub on a RAID-Z pool with only one original disk and a second one that hasn't been resilvered correctly.
Without extensive debugging of the filesystem, the only thing left is to see if we can copy any of the healthy files from the pool to the client. ZFS has already told us which files are corrupted.
As a first attempt I want to see if I can mount the directory on the client and then grab files one at a time:
$ rsync -a --progress --stats mnt/testbox/pub/tmp/ tmp3/ sending incremental file list ./ sending incremental file list ./ 1.pdf 18,576,345 100% 109.84MB/s 0:00:00 (xfr#1, to-chk=7/9) 2.pdf 30,255,102 100% 67.89MB/s 0:00:00 (xfr#2, to-chk=6/9) 3.pdf 22,016,195 100% 33.92MB/s 0:00:00 (xfr#3, to-chk=5/9)
The file transfer halted at "3.pdf".
I then tried copying files over picking one at a time but I could not get any file except the three pdf file - as ZFS already told me.
I got the following error on the client:
Cannot read source file. Bad file descriptor.
So these are the files that I managed to salvage from my broken RAID-5 pool:
$ ls -gG total 69196 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf
This means that I have reached a point where my RAID-Z pool has been destroyed and the files I am apple to restore is very limited. This isn't a surprise as ZFS is extremely good at spreading the data and parity data very evenly across multiple drives in a RAID-Z. If you lose two drives in a RAID-Z you almost always lose the entire pool.
Had the resilvering process managed to run for a longer time before the second drive "failed", perhaps I would have been able to salvage more files, but there really isn't anything more I can do now.
In my humble opinion a RAID-Z2 (RAID-6) is a minimum for very important files, but RAID-5 is still extremely useful too as long as you remember to always keep backup of your important data no matter what RAID setup you're using. A RAID setup is never a substitute for backup!
Alright, time to do some testing on Btrfs.
Btrfs RAID-5
According to the Btrfs wiki:
The parity RAID feature is mostly implemented, but has some problems in the case of power failure (or other unclean shutdown) which lead to damaged data. It is recommended that parity RAID be used only for testing purposes.
Let's setup a Btrfs RAID-5 system:
# mkfs.btrfs -f -m raid5 -d raid5 /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /dev/disk/by-id/ata-ST31000340NS_9QJ0EQ1V /dev/disk/by-id/ata-ST31000340NS_9QJ0F2YQ btrfs-progs v4.20.2 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: 520d615b-4151-4036-962a-ccc202e1f76c Node size: 16384 Sector size: 4096 Filesystem size: 2.73TiB Block group profiles: Data: RAID5 2.00GiB Metadata: RAID5 2.00GiB System: RAID5 16.00MiB SSD detected: no Incompat features: extref, raid56, skinny-metadata Number of devices: 3 Devices: ID SIZE PATH 1 931.51GiB /dev/disk/by-id/ata-ST31000340NS_9QJ089LF 2 931.51GiB /dev/disk/by-id/ata-ST31000340NS_9QJ0EQ1V 3 931.51GiB /dev/disk/by-id/ata-ST31000340NS_9QJ0F2YQ
Then enable lzo compression and mount the pool:
# mount -o noatime,compress=lzo /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /pub/ # btrfs filesystem show -d Label: none uuid: 520d615b-4151-4036-962a-ccc202e1f76c Total devices 3 FS bytes used 128.00KiB devid 1 size 931.51GiB used 2.01GiB path /dev/sdc devid 2 size 931.51GiB used 2.01GiB path /dev/sdb devid 3 size 931.51GiB used 2.01GiB path /dev/sdd # btrfs device stats /pub/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0
Time to transfer the files from the client using rsync
:
$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ 1.pdf 18,576,345 100% 165.28MB/s 0:00:00 (xfr#1, to-chk=6/8) 2.pdf 30,255,102 100% 84.86MB/s 0:00:00 (xfr#2, to-chk=5/8) 3.pdf 22,016,195 100% 31.81MB/s 0:00:00 (xfr#3, to-chk=4/8) bar.mkv 35,456,180,485 100% 107.72MB/s 0:05:13 (xfr#4, to-chk=3/8) boo.iso 625,338,368 100% 21.36MB/s 0:00:27 (xfr#5, to-chk=2/8) foo.mkv 1,548,841,922 100% 131.10MB/s 0:00:11 (xfr#6, to-chk=1/8) moo.iso 415,633,408 100% 24.38MB/s 0:00:16 (xfr#7, to-chk=0/8) Number of files: 8 (reg: 7, dir: 1) Number of created files: 8 (reg: 7, dir: 1) Number of deleted files: 0 Number of regular files transferred: 7 Total file size: 38,116,841,825 bytes Total transferred file size: 38,116,841,825 bytes Literal data: 38,116,841,825 bytes Matched data: 0 bytes File list size: 0 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 38,126,148,151 Total bytes received: 202 sent 38,126,148,151 bytes received 202 bytes 102,078,041.11 bytes/sec total size is 38,116,841,825 speedup is 1.00
Compared to the ZFS RAID-Z1 transfer:
sent 38,126,148,150 bytes received 202 bytes 106,945,717.68 bytes/sec
On the Btrfs machine I receive a clear warning about some of the missing functionality of the RAID5/6 capability which is also described on the Btrfs wiki status:
The write hole is the last missing part, preliminary patches have been posted but needed to be reworked. The parity not checksummed note has been removed.
# btrfs filesystem usage /pub WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented Overall: Device size: 2.73TiB Device allocated: 0.00B Device unallocated: 2.73TiB Device missing: 0.00B Used: 0.00B Free (estimated): 0.00B (min: 8.00EiB) Data ratio: 0.00 Metadata ratio: 0.00 Global reserve: 40.25MiB (used: 0.00B) Data,RAID5: Size:36.00GiB, Used:35.51GiB /dev/sdb 18.00GiB /dev/sdc 18.00GiB /dev/sdd 18.00GiB Metadata,RAID5: Size:2.00GiB, Used:40.44MiB /dev/sdb 1.00GiB /dev/sdc 1.00GiB /dev/sdd 1.00GiB System,RAID5: Size:16.00MiB, Used:16.00KiB /dev/sdb 8.00MiB /dev/sdc 8.00MiB /dev/sdd 8.00MiB Unallocated: /dev/sdb 912.50GiB /dev/sdc 912.50GiB /dev/sdd 912.50GiB
Btrfs - Power outage
I have then again added the "zoo.mkv" file to the files on the client and will begin the rsync
transfer and pull the power cord to the Btrfs machine at about 50% of the transfer.
$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list zoo.mkv 5,887,590,400 54% 71.49kB/s 19:20:49 ^C
The power cord has been pulled. I have aborted the file transfer on the client and the Btrfs machine has been powered back up again:
# btrfs filesystem show -d Label: none uuid: 520d615b-4151-4036-962a-ccc202e1f76c Total devices 3 FS bytes used 35.55GiB devid 1 size 931.51GiB used 19.01GiB path /dev/sdc devid 2 size 931.51GiB used 19.01GiB path /dev/sdb devid 3 size 931.51GiB used 19.01GiB path /dev/sdd # btrfs device stats /pub [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0
Btrfs is also a transactional filesystem and the pool is back up. There are no errors and everything is mountable from the client. As with the ZFS test we have only lost the file that was being transfered.
Btrfs - Drive failure
Time to simulate a drive failure. I will remove the same drive as with ZFS, then afterwards attach a new drive and try to restore the pool.
# btrfs filesystem show -d warning, device 1 is missing checksum verify failed on 83820544 found C780E0CF wanted 23635D79 bad tree block 83820544, bytenr mismatch, want=83820544, have=65536 Label: none uuid: 520d615b-4151-4036-962a-ccc202e1f76c Total devices 3 FS bytes used 35.55GiB devid 2 size 931.51GiB used 19.01GiB path /dev/sdb devid 3 size 931.51GiB used 19.01GiB path /dev/sdd *** Some devices missin
Btrfs is informing us about the missing disk. Let's locate the new one and replace the old with it:
$ ls -gG /dev/disk/by-id ata-ST31000340NS_9QJ0DVN2 -> ../../sdc
I need to mount the pool in a degraded state with one of the working disks:
# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ0F2YQ /pub/
Then because the "broken" device has been removed I have to use the "devid" parameter format in order to replace the device. This is one place in the Btrfs documentation that could benefit from an example.
The "devid" is the missing device ID from the btrfs filesystem show -d
command, not from the "by-id" or "uuid". Also since the new disk already contains a filesystem from the previous test I need to use the -f
option to force the command:
So the command basically is: btrfs replace start old_device new_device mount_point
where old_device is the "devid" number Btrfs has supplied us with:
# btrfs replace start -f 1 /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 /pub
We can then check the status of the replacement:
# btrfs replace status -1 /pub 0.4% done, 0 write errs, 0 uncorr. read errs # iostat -dh /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 Linux 5.0.9-arch1-1-ARCH (testbox) 04/25/2019 _x86_64_ (2 CPU) tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd Device 148.08 5.1k 11.9M 0.0k 5.1M 11.7G 0.0k sdc
ZFS completed the restoration in about 3 minutes while Btrfs was a little more than twice as long about it:
# btrfs replace status -1 /pub Started on 25.Apr 01:39:20, finished on 25.Apr 01:46:59, 0 write errs, 0 uncorr. read errs
The pool is back up again with no missing files or any other problems:
# btrfs filesystem show -d Label: none uuid: 520d615b-4151-4036-962a-ccc202e1f76c Total devices 3 FS bytes used 35.55GiB devid 1 size 931.51GiB used 19.00GiB path /dev/sdc devid 2 size 931.51GiB used 20.03GiB path /dev/sdb devid 3 size 931.51GiB used 20.03GiB path /dev/sdd # btrfs filesystem usage /pub WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented Overall: Device size: 2.73TiB Device allocated: 0.00B Device unallocated: 2.73TiB Device missing: 0.00B Used: 0.00B Free (estimated): 0.00B (min: 8.00EiB) Data ratio: 0.00 Metadata ratio: 0.00 Global reserve: 40.25MiB (used: 0.00B) Data,RAID5: Size:36.00GiB, Used:35.51GiB /dev/sdb 18.00GiB /dev/sdc 18.00GiB /dev/sdd 18.00GiB Metadata,RAID5: Size:3.00GiB, Used:40.44MiB /dev/sdb 2.00GiB /dev/sdc 1.00GiB /dev/sdd 2.00GiB System,RAID5: Size:32.00MiB, Used:16.00KiB /dev/sdb 32.00MiB /dev/sdd 32.00MiB Unallocated: /dev/sdb 911.48GiB /dev/sdc 912.51GiB /dev/sdd 911.48GiB # btrfs device stats /pub [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0
In the above I have noticed that the files are not as equally spread across the devices as before the simulated failure.
Before:
devid 1 size 931.51GiB used 19.01GiB path /dev/sdc devid 2 size 931.51GiB used 19.01GiB path /dev/sdb devid 3 size 931.51GiB used 19.01GiB path /dev/sdd
After:
devid 1 size 931.51GiB used 19.00GiB path /dev/sdc devid 2 size 931.51GiB used 20.03GiB path /dev/sdb devid 3 size 931.51GiB used 20.03GiB path /dev/sdd
But the usage
command shows that this is only due to metadata:
# btrfs filesystem usage /pub ... Metadata,RAID5: Size:3.00GiB, Used:51.58MiB /dev/sdb 2.00GiB /dev/sdc 1.00GiB /dev/sdd 2.00GiB
Let's perform a scrub now and validate that everything is alright:
# btrfs scrub start /pub/ # btrfs scrub status -d /pub/ scrub status for 520d615b-4151-4036-962a-ccc202e1f76c scrub device /dev/sdc (id 1) history scrub started at Thu Apr 25 04:04:57 2019 and finished after 00:28:11 total bytes scrubbed: 15.23GiB with 0 errors scrub device /dev/sdb (id 2) history scrub started at Thu Apr 25 04:04:57 2019 and finished after 00:27:31 total bytes scrubbed: 15.23GiB with 0 errors scrub device /dev/sdd (id 3) history scrub started at Thu Apr 25 04:04:57 2019 and finished after 00:27:32 total bytes scrubbed: 15.23GiB with 0 errors
So far no problems.
Btrfs - Drive failure during file transfer
Now it's time to remove a drive during an active file transfer:
$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list zoo.mkv 10,867,033,488 100% 119.28MB/s 0:01:26 (xfr#3, to-chk=0/9) Number of files: 9 (reg: 8, dir: 1) Number of created files: 1 (reg: 1) Number of deleted files: 0 Number of regular files transferred: 3 Total file size: 48,983,875,313 bytes Total transferred file size: 10,919,304,785 bytes Literal data: 10,919,304,785 bytes Matched data: 0 bytes File list size: 0 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 10,921,970,963 Total bytes received: 76 sent 10,921,970,963 bytes received 76 bytes 96,228,819.73 bytes/sec total size is 48,983,875,313 speedup is 4.48
Btrfs reacted exactly the same way ZFS did. It momentarily halted the file transfer for about a second, then resumed the transfer without the client being able to notice anything other than the momentary drop in the file transfer speed.
On the Btrfs machine the pool has changed the state to a missing device:
# btrfs filesystem show -d /pub Label: none uuid: e4f04b17-c62b-4847-beeb-753bbb64c79a Total devices 3 FS bytes used 36.99GiB devid 1 size 931.51GiB used 21.01GiB path /dev/sdc devid 2 size 931.51GiB used 21.01GiB path /dev/sdd *** Some devices missing
# btrfs filesystem usage /pub WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented Overall: Device size: 2.73TiB Device allocated: 0.00B Device unallocated: 2.73TiB Device missing: 931.51GiB Used: 0.00B Free (estimated): 0.00B (min: 8.00EiB) Data ratio: 0.00 Metadata ratio: 0.00 Global reserve: 51.75MiB (used: 0.00B) Data,RAID5: Size:48.00GiB, Used:45.63GiB /dev/sdb 24.00GiB /dev/sdc 24.00GiB /dev/sdd 24.00GiB Metadata,RAID5: Size:2.00GiB, Used:51.92MiB /dev/sdb 1.00GiB /dev/sdc 1.00GiB /dev/sdd 1.00GiB System,RAID5: Size:16.00MiB, Used:16.00KiB /dev/sdb 8.00MiB /dev/sdc 8.00MiB /dev/sdd 8.00MiB Unallocated: /dev/sdb 906.50GiB /dev/sdc 906.50GiB /dev/sdd 906.50GiB
As with ZFS I powered down the machine in order to safely reattach the drive and then rebooted.
# btrfs filesystem show -d Label: none uuid: e4f04b17-c62b-4847-beeb-753bbb64c79a Total devices 3 FS bytes used 35.38GiB devid 1 size 931.51GiB used 25.01GiB path /dev/sdc devid 2 size 931.51GiB used 25.01GiB path /dev/sdd devid 3 size 931.51GiB used 19.01GiB path /dev/sdb
The show
command reveals that the pool is not in balance. In order to get more information I need to mount the pool and then use the device stats
command.
The device stats
keep a persistent record of several error classes related to doing IO. The current values are printed at mount time and updated during filesystem lifetime or from a scrub:
# btrfs device stats /pub/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sdb].write_io_errs 16 [/dev/sdb].read_io_errs 1 [/dev/sdb].flush_io_errs 5 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0
The status report clearly shows write errors.
In the current situation the correct approach is to do a scrub:
# btrfs scrub start /pub/ scrub started on /pub/, fsid e4f04b17-c62b-4847-beeb-753bbb64c79a (pid=583)
As with ZFS, Btrfs is now running through the data and the checksums and is trying to repair the data.
After the scrubbing is done Btrfs tells us that it has repaired quite a lot of data, all with 0 uncorrectable errors.
# btrfs scrub status -d /pub/ scrub status for e4f04b17-c62b-4847-beeb-753bbb64c79a scrub device /dev/sdc (id 1) history scrub started at Fri Apr 26 03:18:57 2019 and finished after 00:49:34 total bytes scrubbed: 15.23GiB with 25452 errors error details: csum=25452 corrected errors: 25452, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdd (id 2) history scrub started at Fri Apr 26 03:18:57 2019 and finished after 00:47:19 total bytes scrubbed: 15.23GiB with 27768 errors error details: csum=27768 corrected errors: 27768, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdb (id 3) history scrub started at Fri Apr 26 03:18:57 2019 and finished after 00:49:34 total bytes scrubbed: 15.23GiB with 2316 errors error details: csum=2316 corrected errors: 2316, uncorrectable errors: 0, unverified errors: 0
Btrfs has managed to repair everything without any errors but it still keeps the logs of the errors.
What is noticeable is that ZFS finished the scrubbing and repair in just about 6 minutes while Btrfs took about 50 minutes.
This is because Btrfs has also brought the pool into balance during the scrubbing and Btrfs is famous for being very slow at re-balancing drives:
# btrfs filesystem show -d Label: none uuid: e4f04b17-c62b-4847-beeb-753bbb64c79a Total devices 3 FS bytes used 45.68GiB devid 1 size 931.51GiB used 25.56GiB path /dev/sdc devid 2 size 931.51GiB used 25.56GiB path /dev/sdd devid 3 size 931.51GiB used 25.56GiB path /dev/sdb
Again it is up to the system administrator to decide what he or she wants to do. And this is important because even though Btrfs has managed to restore the pool we might still be dealing with am unhealthy device.
Can we clear the log? Or do we need to replace the drive anyway? Perhaps S.M.A.R.T has warned us that the drive is currently working, but it is experiencing occasional issues and soon needs to be fully replaced.
# btrfs device stats -c /pub/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sdb].write_io_errs 16 [/dev/sdb].read_io_errs 1 [/dev/sdb].flush_io_errs 5 [/dev/sdb].corruption_errs 55536 [/dev/sdb].generation_errs 0
The result of the scrubbing showed zero uncorrectable errors and I know the drive is working fine so I'll just clear the log with the -z
option:
# btrfs device stats -z /pub/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0
Btrfs - Data corruption during file transfer
Now I want to simulate the same disk corruption in the middle of a file transfer from the client as I did with ZFS.
I have removed the "zoo.mkv" file and while rsync
is running I will use dd
a couple of times on the Btrfs machine on one of the drives:
# dd if=/dev/urandom of=/dev/disk/by-id/ata-ST31000340NS_9QJ0ET8D seek=100000 count=1000 bs=1k
The device stats command did not show any problems:
# btrfs device stats -c /pub [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0
However, both dmesg and the log reveals something:
[ 1932.091249] BTRFS error (device sdc): csum mismatch on free space cache [ 1932.091262] BTRFS warning (device sdc): failed to load free space cache for block group 42988470272, rebuilding it now [ 1932.334063] BTRFS error (device sdc): csum mismatch on free space cache [ 1932.334076] BTRFS warning (device sdc): failed to load free space cache for block group 47283437568, rebuilding it now [ 2005.178214] BTRFS error (device sdc): space cache generation (17) does not match inode (19) [ 2005.178222] BTRFS warning (device sdc): failed to load free space cache for block group 38693502976, rebuilding it now
Btrfs did detect the problem and automatically fixed it, but I had expected this kind of error to show up in the device stat
result perhaps as a corruption error count.
Btrfs - The dd mistake
Time to to see what's going to happen if I by mistake type the dd
command on one of the drives during a file transfer from the client.
As with the ZFS test I have deleted all the files, restarted rsync
:
# dd if=/dev/urandom of=/dev/sdb bs=1k ^C232089+0 records in 232089+0 records out 237659136 bytes (238 MB, 227 MiB) copied, 27.3843 s, 8.7 MB/s
Again the device stats command didn't show any problems:
# btrfs device stats -c /pub/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0
But dmesg did after about a minut:
[ 867.808813] BTRFS error (device sdc): bad tree block start, want 53133312 have 17920600362259148199 [ 867.848391] BTRFS info (device sdc): read error corrected: ino 0 off 53133312 (dev /dev/sdb sector 32480) [ 867.886255] BTRFS info (device sdc): read error corrected: ino 0 off 53137408 (dev /dev/sdb sector 32488) [ 867.893746] BTRFS info (device sdc): read error corrected: ino 0 off 53141504 (dev /dev/sdb sector 32496) [ 867.903079] BTRFS info (device sdc): read error corrected: ino 0 off 53145600 (dev /dev/sdb sector 32504) [ 867.928986] BTRFS error (device sdc): bad tree block start, want 53100544 have 125614526405871379 [ 867.946912] BTRFS info (device sdc): read error corrected: ino 0 off 53100544 (dev /dev/sdb sector 32416) [ 867.948135] BTRFS info (device sdc): read error corrected: ino 0 off 53104640 (dev /dev/sdb sector 32424) [ 867.948793] BTRFS info (device sdc): read error corrected: ino 0 off 53108736 (dev /dev/sdb sector 32432) [ 867.952210] BTRFS info (device sdc): read error corrected: ino 0 off 53112832 (dev /dev/sdb sector 32440) [ 868.128686] BTRFS error (device sdc): bad tree block start, want 43614208 have 15420301482281005013 [ 868.130861] BTRFS error (device sdc): bad tree block start, want 43614208 have 15420301482281005013 [ 868.196118] BTRFS error (device sdc): bad tree block start, want 43614208 have 15420301482281005013 [ 868.296277] BTRFS error (device sdc): bad tree block start, want 43614208 have 15420301482281005013 [ 868.333942] BTRFS info (device sdc): read error corrected: ino 0 off 43614208 (dev /dev/sdb sector 23104) [ 868.337820] BTRFS info (device sdc): read error corrected: ino 0 off 43618304 (dev /dev/sdb sector 23112) [ 868.353572] BTRFS error (device sdc): bad tree block start, want 43630592 have 10676903441545527670 [ 868.378400] BTRFS error (device sdc): bad tree block start, want 43597824 have 485580186567037103 [ 868.531339] BTRFS error (device sdc): bad tree block start, want 46039040 have 1852668134064264900 [ 868.569488] BTRFS error (device sdc): bad tree block start, want 46055424 have 418370625237599952
On the client, as with ZFS, there is nothing noticeable going on during file transfer.
Time to run a scrub in order to correct the errors:
# btrfs scrub start /pub/ scrub started on /pub/, fsid 045b8eb9-267a-479b-92af-a996d9a27d12 (pid=468)
# btrfs scrub status -d /pub/ scrub status for 045b8eb9-267a-479b-92af-a996d9a27d12 scrub device /dev/disk/by-id/ata-ST31000340NS_9QJ089LF (id 1) status scrub started at Tue Apr 30 22:58:47 2019, running for 00:01:00 total bytes scrubbed: 582.61MiB with 9 errors error details: csum=9 corrected errors: 9, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdb (id 2) status scrub started at Tue Apr 30 22:58:47 2019, running for 00:01:00 total bytes scrubbed: 543.27MiB with 639 errors error details: csum=639 corrected errors: 639, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdd (id 3) status scrub started at Tue Apr 30 22:58:47 2019, running for 00:01:00 total bytes scrubbed: 480.40MiB with 2 errors error details: csum=2 corrected errors: 2, uncorrectable errors: 0, unverified errors: 0 WARNING: errors detected during scrubbing, corrected
Btrfs has detected the errors and fixed them:
scrub status for 045b8eb9-267a-479b-92af-a996d9a27d12 scrub device /dev/disk/by-id/ata-ST31000340NS_9QJ089LF (id 1) history scrub started at Tue Apr 30 22:58:47 2019 and finished after 00:25:08 total bytes scrubbed: 15.23GiB with 9 errors error details: csum=9 corrected errors: 9, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdb (id 2) history scrub started at Tue Apr 30 22:58:47 2019 and finished after 00:25:06 total bytes scrubbed: 15.23GiB with 639 errors error details: csum=639 corrected errors: 639, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdd (id 3) history scrub started at Tue Apr 30 22:58:47 2019 and finished after 00:25:16 total bytes scrubbed: 15.23GiB with 2 errors error details: csum=2 corrected errors: 2, uncorrectable errors: 0, unverified errors: 0
# btrfs device stats -c /pub/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 650 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0
Btrfs handled the problem just as well as ZFS. The only difference was the time it took to do the scrub.
Btrfs - The "write hole" issue
Since Btrfs still has warnings about the write hole issue I would like to see if it's possible to recreate the problem in this test.
Parity may be inconsistent after a crash (the "write hole"). The problem born when after "an unclean shutdown" a disk failure happens. But these are two distinct failures. These together break the BTRFS raid5 redundancy. If you run a scrub process after "an unclean shutdown" (with no disk failure in between) those data which match their checksum can still be read out while the mismatched data are lost forever.
These two issues has to exist at the same time:
- An unclean shutdown.
- A disk failure.
So pulling the power cord to the machine during a file transfer and then simulating a disk failure by removing one of the drives should potentially re-create the issue.
I have removed the "zoo.mkv" file from the files on the Btrfs machine and will pull the power cord during the file transfer of the file and will then remove a drive and see what's going to happen.
$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list zoo.mkv 7,176,814,592 66% 58.80kB/s 17:25:53 ^C
The Btrfs machine has now suffered an unclean shutdown. I have aborted the file transfer on the client and unmounted the Btrfs export. I have then physically changed one of the drives in the Btrfs machine and will now try to do a replacement.
# btrfs filesystem show -d warning, device 2 is missing checksum verify failed on 117506048 found E6CE304B wanted 022D8DFD bad tree block 117506048, bytenr mismatch, want=117506048, have=65536 Couldn't setup extent tree checksum verify failed on 117538816 found 151B2790 wanted F1F89A26 bad tree block 117538816, bytenr mismatch, want=117538816, have=65536 Couldn't setup device tree Label: none uuid: 045b8eb9-267a-479b-92af-a996d9a27d12 Total devices 3 FS bytes used 38.61GiB devid 1 size 931.51GiB used 21.01GiB path /dev/sdc devid 3 size 931.51GiB used 21.01GiB path /dev/sdd *** Some devices missin
In the previous test where I simulated a drive failure I got the same error messages except that this time Btrfs is complaining about "couldn't setup device tree".
I will now mount the pool in a degraded state and replace the faulty drive and see if we can't salvage any data from the pool. The mounting has to be performed with a healty drive:
# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ0 /pub
This time it is "devid" 2 I need to replace. The new disk is the "9QJ0ET8D" one:
# btrfs replace start -f 2 /dev/disk/by-id/ata-ST31000340NS_9QJ0ET8D /pub/
Let's check the status of the replacement:
# btrfs replace status -1 /pub/ 0.4% done, 0 write errs, 0 uncorr. read errs
Then after a little while:
# btrfs replace status -1 /pub/ Started on 30.Apr 23:58:34, finished on 1.May 00:06:39, 0 write errs, 0 uncorr. read errs
# btrfs filesystem show -d Label: none uuid: 045b8eb9-267a-479b-92af-a996d9a27d12 Total devices 3 FS bytes used 38.61GiB devid 1 size 931.51GiB used 22.04GiB path /dev/sdc devid 2 size 931.51GiB used 21.00GiB path /dev/sdb devid 3 size 931.51GiB used 22.04GiB path /dev/sdd
# btrfs device stats -c /pub [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0
# ls -gG /pub/tmp/ total 37223496 -rwxrw-r-- 1 18576345 Apr 21 09:08 1.pdf -rwxrw-r-- 1 30255102 Apr 21 09:08 2.pdf -rwxrw-r-- 1 22016195 Apr 21 09:08 3.pdf -rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv -rwxrw-r-- 1 625338368 Mar 5 2018 boo.iso -rwxrw-r-- 1 1548841922 Apr 15 23:50 foo.mkv -rwxrw-r-- 1 415633408 Mar 5 2018 moo.iso
Everything has been restored nicely and all three drives are performing well. I didn't lose any files or suffered any parity issues that made the replacement a problem.
I have repeated the above test with the same result more than once.
Btrfs - A second drive failure during a replacement
Now I want to see what's going to happen with Btrfs when I lose a second drive during a replacement procedure.
I have removed one of the drives and I mounting the Btrfs pool in a degraded state in order to begin a replacement:
# btrfs filesystem show -d Label: none uuid: 045b8eb9-267a-479b-92af-a996d9a27d12 Total devices 3 FS bytes used 38.61GiB devid 1 size 931.51GiB used 22.03GiB path /dev/sdc devid 3 size 931.51GiB used 22.03GiB path /dev/sdd *** Some devices missing
# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /pub
As I did with ZFS, while the replacement procedure is running I will disconnect one of the working drives.
# btrfs replace start -f 2 /dev/disk/by-id/ata-ST31000340NS_9QJ0ET8D /pub/
Let's check the status:
# btrfs replace status -1 /pub/ 0.1% done, 0 write errs, 0 uncorr. read errs
I now disconnect a second drive by removing the power cord for the drive:
# btrfs replace status -1 /pub/ Started on 3.May 21:03:21, canceled on 3.May 21:04:12 at 0.0%, 0 write errs, 0 uncorr. read errs
Btrfs cancelled the replacement when the second drive went offline.
# ls -gG /pub/tmp/ ls: cannot access '/pub/tmp/boo.iso': Input/output error ls: cannot access '/pub/tmp/foo.mkv': Input/output error ls: cannot access '/pub/tmp/moo.iso': Input/output error total 34694376 -rwxrw-r-- 1 18576345 Apr 21 09:08 1.pdf -rwxrw-r-- 1 30255102 Apr 21 09:08 2.pdf -rwxrw-r-- 1 22016195 Apr 21 09:08 3.pdf -rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv -????????? ? ? ? boo.iso -????????? ? ? ? foo.mkv -????????? ? ? ? moo.iso
We clearly have a problem.
I have attached a new drive and the Btrfs machine now only has one healthy drive in the pool and two new drives of which one has only been partly replaced.
# umount /pub # btrfs filesystem show -d warning, device 3 is missing Label: none uuid: 045b8eb9-267a-479b-92af-a996d9a27d12 Total devices 3 FS bytes used 38.61GiB devid 1 size 931.51GiB used 22.03GiB path /dev/sdc *** Some devices missing
# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /pub/ # btrfs filesystem show -d warning, device 3 is missing checksum verify failed on 119832576 found B67B4ABD wanted A302A7B3 checksum verify failed on 119832576 found B67B4ABD wanted A302A7B3 bad tree block 119832576, bytenr mismatch, want=119832576, have=5117397648563945276 Label: none uuid: 045b8eb9-267a-479b-92af-a996d9a27d12 Total devices 3 FS bytes used 38.61GiB devid 1 size 931.51GiB used 22.03GiB path /dev/sdc devid 2 size 931.51GiB used 21.01GiB path /dev/sdd *** Some devices missing
The drive that went through the partly replacement is at least recognized as belonging to the pool.
Now, in this situation trying to run any kind of repair process would not only be futile, but it would also be very wrong. The filesystem isn't damaged and it doesn't require any kind of repairing.
Again I will try to replace the third disk and see if I maybe have enough data and metadata lying around to actually restore the pool without loosing any data (as with ZFS this is very a long shot):
Let's locate the new disk:
# ls -l /dev/disk/by-id/ ata-ST31000340NS_9QJ089LF -> ../../sdc ata-ST31000340NS_9QJ0DVN2 -> ../../sdd ata-ST31000340NS_9QJ0ES1V -> ../../sdb
"devid" 3 needs to be replaced with the "9QJ0ES1V" one:
# btrfs replace start -f 3 /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V /pub/
No errors. Let's check the status:
# btrfs replace status -1 /pub/ Started on 3.May 21:03:21, suspended on 1.May 00:06:39 at 0.2%, 0 write errs, 0 uncorr. read errs
Suspended!
Let's see what dmesg
says:
[ 509.084144] BTRFS info (device sdc): use lzo compression, level 0 [ 509.084147] BTRFS info (device sdc): allowing degraded mounts [ 509.084148] BTRFS info (device sdc): disk space caching is enabled [ 509.084150] BTRFS info (device sdc): has skinny extents [ 509.107081] BTRFS warning (device sdc): devid 3 uuid 9078bc78-a5ba-4178-96ca-53fb2e29b62c is missing [ 509.167206] BTRFS info (device sdc): cannot continue dev_replace, tgtdev is missing [ 509.167208] BTRFS info (device sdc): you may cancel the operation after 'mount -o degraded'
So a replacement is not possible.
On ZFS we get much better information using the zpool status -v
about both the replacement status and about the specific files cannot be restored.
Let's run a scrub and see if by any change we can salvage some files and then restore as much of the pool as possible:
# btrfs scrub start /pub/ scrub started on /pub/, fsid 045b8eb9-267a-479b-92af-a996d9a27d12 (pid=497)
Let's check:
# btrfs scrub status -d /pub/ scrub status for 045b8eb9-267a-479b-92af-a996d9a27d12 scrub device /dev/sdc (id 1) history scrub started at Fri May 3 21:18:18 2019 and was aborted after 00:00:00 total bytes scrubbed: 0.00B with 0 errors scrub device /dev/sdd (id 2) history scrub started at Fri May 3 21:18:18 2019 and was aborted after 00:00:00 total bytes scrubbed: 0.00B with 0 errors scrub device /dev/sdd (id 3) history scrub started at Fri May 3 21:18:18 2019 and was aborted after 00:00:00 total bytes scrubbed: 0.00B with 0 errors
Aborted.
This was a no go, we cannot do a scrub on a RAID-5 pool with only one original disk and a second one that hasn't been replaced correctly.
# ls -gG /pub/tmp/ total 37223496 -rwxrw-r-- 1 18576345 Apr 21 09:08 1.pdf -rwxrw-r-- 1 30255102 Apr 21 09:08 2.pdf -rwxrw-r-- 1 22016195 Apr 21 09:08 3.pdf -rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv -rwxrw-r-- 1 625338368 Mar 5 2018 boo.iso -rwxrw-r-- 1 1548841922 Apr 15 23:50 foo.mkv -rwxrw-r-- 1 415633408 Mar 5 2018 moo.iso
The only thing left is to see how many files I can salvage:
$ rsync -a --progress --stats mnt/testbox/pub/tmp/ tmp3/ sending incremental file list ./ 1.pdf 18,576,345 100% 111.22MB/s 0:00:00 (xfr#1, to-chk=6/8) 2.pdf 30,255,102 100% 68.05MB/s 0:00:00 (xfr#2, to-chk=5/8) 3.pdf 22,016,195 100% 33.97MB/s 0:00:00 (xfr#3, to-chk=4/8) bar.mkv 41,451,520 0% 39.18MB/s 0:14:42
Then it halted.
I then tried copying files over picking one at a time and to my big supprise I actually managed to get all the files except the "bar.mkv" file!
$ ls -gG total 2598328 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf -rwxr-xr-x 1 625338368 Mar 5 2018 boo.iso -rwxr-xr-x 1 1548841922 Apr 15 23:50 foo.mkv -rwxr-xr-x 1 415633408 Mar 5 2018 moo.iso
During the attempt to transfer the "bar.mkv" file the following errors showed up on the Btrfs machine:
# dmesg [ 4177.376785] BTRFS error (device sdc): bad tree block start, want 38944768 have 7071809559058736496 [ 4177.378494] BTRFS error (device sdc): bad tree block start, want 38961152 have 16350034114213725736 [ 4177.378718] BTRFS error (device sdc): bad tree block start, want 38977536 have 8392528330119265768 [ 4177.379183] BTRFS error (device sdc): bad tree block start, want 38928384 have 6084014255993522895 [ 4181.808743] BTRFS critical (device sdc): corrupt node: root=7 block=39124992 slot=0, unaligned pointer, have 12335186693368 should be aligned to 4096 [ 4181.808757] BTRFS info (device sdc): no csum found for inode 261 start 52690944 [ 4181.808856] BTRFS critical (device sdc): corrupt node: root=7 block=39124992 slot=0, unaligned pointer, have 12335186693368 should be aligned to 4096 [ 4181.808866] BTRFS info (device sdc): no csum found for inode 261 start 52695040 [ 4181.808955] BTRFS critical (device sdc): corrupt node: root=7 block=39124992 slot=0, unaligned pointer, have 12335186693368 should be aligned to 4096 [ 4181.808965] BTRFS info (device sdc): no csum found for inode 261 start 52699136 [ 4181.809051] BTRFS critical (device sdc): corrupt node: root=7 block=39124992 slot=0, unaligned pointer, have 12335186693368 should be aligned to 4096 ...
Btrfs has the "btrfs restore" command which is used to try to salvage files from a damaged filesystem and restore them somewhere else. The man page explaines:
btrfs restore could be used to retrieve file data, as far as the metadata are readable. The checks done by restore are less strict and the process is usually able to get far enough to retrieve data from the whole filesystem. This comes at a cost that some data might be incomplete or from older versions if they’re available.
There are several options to attempt restoration of various file metadata type. You can try a dry run first to see how well the process goes and use further options to extend the set of restored metadata.
I have 129G available on the boot disk so I can try to restore files to that drive.
I'm going to use "sdc" first, which is the heatly and original working drive. Then followed by "sdd" which is the disk that was partly replaced. The last disk "sdb" is useless.
# mkdir /restored-files # umount /pub # btrfs restore -D /dev/sdc /restored-files/ warning, device 3 is missing checksum verify failed on 115867648 found E486C552 wanted 006578E4 bad tree block 115867648, bytenr mismatch, want=115867648, have=65536 Could not open root, trying backup super warning, device 3 is missing checksum verify failed on 38895616 found 69CF6F65 wanted 8D2CD2D3 checksum verify failed on 38895616 found 69CF6F65 wanted 8D2CD2D3 bad tree block 38895616, bytenr mismatch, want=38895616, have=65536 checksum verify failed on 115867648 found E486C552 wanted 006578E4 bad tree block 115867648, bytenr mismatch, want=115867648, have=65536 Could not open root, trying backup super warning, device 3 is missing checksum verify failed on 38895616 found 69CF6F65 wanted 8D2CD2D3 checksum verify failed on 38895616 found 69CF6F65 wanted 8D2CD2D3 bad tree block 38895616, bytenr mismatch, want=38895616, have=65536 checksum verify failed on 115867648 found E486C552 wanted 006578E4 bad tree block 115867648, bytenr mismatch, want=115867648, have=65536 Could not open root, trying backup super # btrfs restore -D /dev/sdd /restored-files/ warning, device 3 is missing checksum verify failed on 22020096 found 7DCD7CC1 wanted 28699DE8 checksum verify failed on 22020096 found 7DCD7CC1 wanted 28699DE8 bad tree block 22020096, bytenr mismatch, want=22020096, have=899525736547221204 ERROR: cannot read chunk root Could not open root, trying backup super warning, device 3 is missing warning, device 1 is missing bad tree block 22020096, bytenr mismatch, want=22020096, have=0 ERROR: cannot read chunk root Could not open root, trying backup super warning, device 3 is missing warning, device 1 is missing bad tree block 22020096, bytenr mismatch, want=22020096, have=0 ERROR: cannot read chunk root Could not open root, trying backup super
Removing the useless disk in order to try to run on two disks only doesn't work because it is a RAID-5 which needs at least three disks:
# btrfs device remove missing 3 /pub ERROR: error removing device 'missing': unable to go below two devices on raid5 ERROR: error removing devid 3: unable to go below two devices on raid5
Adding a new disk in order to try to have Btrfs re-balance fails as expected:
# btrfs balance start -v /pub/ Dumping filters: flags 0x7, state 0x0, force is off DATA (flags 0x0): balancing METADATA (flags 0x0): balancing SYSTEM (flags 0x0): balancing WARNING: Full balance without filters requested. This operation is very intense and takes potentially very long. It is recommended to use the balance filters to narrow down the scope of balance. Use 'btrfs balance start --full-balance' option to skip this warning. The operation will start in 10 seconds. Use Ctrl-C to stop it. 10 9 8 7 6 5 4 3 2 1 Starting balance without any filters.
The balance ends prematurely:
# dmesg [ 1179.816473] BTRFS info (device sdc): balance: resume -dusage=90 -musage=90 -susage=90 [ 1179.816732] BTRFS info (device sdc): relocating block group 48524951552 flags data|raid5 [ 1180.074942] BTRFS info (device sdc): relocating block group 47451209728 flags metadata|raid5 [ 1180.391206] BTRFS info (device sdc): found 12 extents [ 1180.632952] BTRFS info (device sdc): relocating block group 47384100864 flags system|raid5 [ 1180.894086] BTRFS info (device sdc): found 1 extents [ 1181.132700] BTRFS info (device sdc): relocating block group 42988470272 flags data|raid5 [ 1211.850068] BTRFS info (device sdc): found 13 extents [ 1213.063935] BTRFS error (device sdc): bad tree block start, want 65650688 have 13914138350834705721 [ 1213.072832] BTRFS: error (device sdc) in btrfs_run_delayed_refs:3011: errno=-5 IO failure [ 1213.072834] BTRFS info (device sdc): forced readonly [ 1213.072859] BTRFS info (device sdc): balance: ended with status: -30
I was actually very surprised at the number of files that I managed to salvage with Btrfs.
This means that either all the files, except the missing one, was located physically on that single healthy drive or parts of the files plus the needed parity data was all located on that single healthy drive plus the second drive that was partially replaced.
Does this mean that Btrfs perhaps isn't very good at balancing data and parity data evenly across multiple drives in a RAID-5 setup so that I ended up having most of the data needed on only one drive?
Or does this mean that with Btrfs sometimes you just "get lucky" and stand a greater chance at getting your files back even when two drives fail in a RAID-5 setup?
I decided to re-test this in order to see if I would get the same results again. This time by pulling the "sdc" disk which was healthy before. Of course I might just get the same results because Btrfs is now using another disk in the same way.
I have created a completely fresh RAID-5 pool and mounted it:
# mkfs.btrfs -f -m raid5 -d raid5 /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 /dev/disk/by-id/ata-ST31000340NS_9QJ0EZZC btrfs-progs v4.20.2 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: 226b366f-64f0-447e-87eb-31c91e5992b6 Node size: 16384 Sector size: 4096 Filesystem size: 2.73TiB Block group profiles: Data: RAID5 2.00GiB Metadata: RAID5 2.00GiB System: RAID5 16.00MiB SSD detected: no Incompat features: extref, raid56, skinny-metadata Number of devices: 3 Devices: ID SIZE PATH 1 931.51GiB /dev/disk/by-id/ata-ST31000340NS_9QJ089LF 2 931.51GiB /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 3 931.51GiB /dev/disk/by-id/ata-ST31000340NS_9QJ0EZZC # mount -o noatime,compress=lzo /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /pub
Then I have transfered all the files from the client again:
# ls -gG /pub/tmp/ total 37223496 -rwxrw-r-- 1 18576345 Apr 21 09:08 1.pdf -rwxrw-r-- 1 30255102 Apr 21 09:08 2.pdf -rwxrw-r-- 1 22016195 Apr 21 09:08 3.pdf -rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv -rwxrw-r-- 1 625338368 Mar 5 2018 boo.iso -rwxrw-r-- 1 1548841922 Apr 15 23:50 foo.mkv -rwxrw-r-- 1 415633408 Mar 5 2018 moo.iso # btrfs filesystem df /pub/ Data, RAID5: total=36.00GiB, used=35.51GiB System, RAID5: total=16.00MiB, used=16.00KiB Metadata, RAID5: total=2.00GiB, used=40.39MiB GlobalReserve, single: total=40.20MiB, used=0.00B
I have now removed the device that was "sdc".
# btrfs filesystem show -d warning, device 1 is missing checksum verify failed on 85508096 found A0A8052D wanted 444BB89B bad tree block 85508096, bytenr mismatch, want=85508096, have=65536 Couldn't read tree root Label: none uuid: 663b05c8-c9b3-4c88-a450-36b5e25a39c2 Total devices 3 FS bytes used 35.55GiB devid 2 size 931.51GiB used 19.01GiB path /dev/sdb devid 3 size 931.51GiB used 19.01GiB path /dev/sdd *** Some devices missing
I am then mounting the Btrfs pool in a degraded state and beginning a replacement, then I will remove the next drive from the pool during the replacement:
# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 /pub/ # btrfs replace start -f 1 /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V /pub/ # btrfs replace status -1 /pub 0.2% done, 0 write errs, 0 uncorr. read errs
This time I am experiencing a crash:
# dmesg [ 581.184298] kernel BUG at fs/btrfs/raid56.c:1910! [ 581.184304] invalid opcode: 0000 [#3] PREEMPT SMP PTI [ 581.184309] CPU: 1 PID: 366 Comm: kworker/u8:0 Tainted: G D I 5.0.10-arch1-1-ARCH #1 [ 581.184315] Hardware name: Hewlett-Packard HP Compaq dc7900 Small Form Factor/3031h, BIOS 786G1 v01.08 08/25/2008 [ 581.184351] Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs] [ 581.184385] RIP: 0010:__raid_recover_end_io+0x37e/0x450 [btrfs] [ 581.184390] Code: 00 ff ff ff ff 85 c0 74 47 83 f8 02 0f 85 e3 00 00 00 48 83 c4 10 48 89 df 31 f6 5b 5d 41 5c 41 5d 41 5e 41 5f e9 f2 ee ff ff <0f> 0b 4c 8d a3 98 00 00 00 4c 89 e7 e8 51 73 f1 d7 f0 80 8b b0 00 [ 581.184399] RSP: 0018:ffff9eb141347e18 EFLAGS: 00010213 [ 581.184403] RAX: ffff92d37c72a800 RBX: ffff92d37f03d800 RCX: 0000000000000000 [ 581.184408] RDX: 0000000000000002 RSI: 0000000000000010 RDI: 0000000000000003 [ 581.184412] RBP: 0000000000000000 R08: 0000000000000008 R09: ffff92d391a0a000 [ 581.184417] R10: 0000000000000008 R11: 000000000000000c R12: 0000000000000003 [ 581.184426] R13: 0000000000000000 R14: 0000000000000001 R15: ffff92d384525e80 [ 581.184435] FS: 0000000000000000(0000) GS:ffff92d393a80000(0000) knlGS:0000000000000000 [ 581.184440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 581.184445] CR2: 000055fdae19b1c8 CR3: 000000020f3ca000 CR4: 00000000000406e0 [ 581.184449] Call Trace: [ 581.184484] normal_work_helper+0xbd/0x350 [btrfs] [ 581.184491] process_one_work+0x1eb/0x410 [ 581.184496] worker_thread+0x2d/0x3d0 [ 581.184501] ? process_one_work+0x410/0x410 [ 581.184506] kthread+0x112/0x130 [ 581.184511] ? kthread_park+0x80/0x80 [ 581.184516] ret_from_fork+0x35/0x40 [ 581.184521] Modules linked in: snd_hda_codec_analog i915 snd_hda_codec_generic ledtrig_audio kvmgt vfio_mdev mdev btrfs vfio_iommu_type1 vfio i2c_algo_bit snd_hda_intel drm_kms_helper snd_hda_codec coretemp drm snd_hda_core libcrc32c syscopyarea kvm snd_hwdep sysfillrect snd_pcm sysimgblt xor fb_sys_fops irqbypass snd_timer input_leds snd raid6_pq joydev tpm_infineon psmouse tpm_tis soundcore hp_wmi tpm_tis_core intel_agp sparse_keymap mei_wdt iTCO_wdt e1000e mei_me tpm intel_gtt iTCO_vendor_support rfkill pcspkr mei gpio_ich agpgart wmi_bmof evdev rng_core mac_hid lpc_ich wmi pcc_cpufreq acpi_cpufreq ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid sd_mod serio_raw uhci_hcd atkbd libps2 ahci libahci ata_generic pata_acpi libata ehci_pci ehci_hcd scsi_mod floppy i8042 serio
Also the replacement have stalled. So I rebooted the Btrfs machine.
Now, I cannot mount the filesystem:
# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 /pub/ mount: /pub: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.
I have tried btrfs device scan
and to mount in "recovery" mode, and I have also tried using btrfs restore
, and I have tried the btrfs rescue zero-log
, but nothing worked.
Have I just now hit one of the RAID-5 bugs? The wiki do say:
The parity RAID code has multiple serious data-loss bugs in it. It should not be used for anything other than testing purposes.
The answer is actually no. Well, the crash is a bug, but the mount issue is not a bug.
The simple fact is that you cannot expect to survive a two-drive failure in a RAID-5 setup no matter what filesystem you are using.
Sometimes, as in the first attempt, you might get away with restoring some files. At other times you will simply lose the entire pool. Expect the later with both ZFS and Btrfs!
Enough Btrfs for now. Time to test mdadm+dm-integrity.
mdadm+dm-integrity RAID-5
UPDATE 2019-08-27: It has come to my attention (thank you Philip!) that I made an unfortunate mistake in my tests of mdadm+dm-integrity. When I tested for data integrity errors I wrote to /dev/mapper/sdb
which also updates the dm-integrity checksum. Later when I do the sync-action check, the errors are not the dm-integrity checksum errors, but rather the RAID parity errors. The correct test should have been to write random data to /dev/sdb
. At the end of the mdadm-dm+integrity section I have copy/pasted the result of a test Philip send me by email which contains an example of how the test should have been run. I have also updated the article with a note each time I made the mistake.
I stumbled upon dm-integrity as I was doing some of the tests with Btrfs and I haven't used it before. I therefore thought that it would be interesting to see how mdadm+dm-integrity handles the same problems that I have just tested ZFS and Btrfs with.
mdadm is used for administering pure software RAID using plain block devices, but it does not provide any kind of data integrity verification. If a read error is encountered with mdadm, the block in error is calculated and written back. If the pool is a mirror, as it can't calculate the correct data, it will take the data from the first available drive and assume it is correct and write it to the other drive. If it is a degraded RAID pool mdadm will terminate immediately without doing anything, as it cannot recalculate the faulty data.
The dm-integrity has nothing to do with any kind of RAID setup. dm-integrity will just return an EILSEQ (instead of EIO error) when it encounters a data integrity error, it is then up to the RAID driver, mdadm in this case, to handle integrity errors properly. dm-integrity does not require encryption when it is not desired by technical or other reasons.
From the documentation:
The dm-integrity target can also be used as a standalone target, in this mode it calculates and verifies the integrity tag internally. In this mode, the dm-integrity target can be used to detect silent data corruption on the disk or in the I/O path.
To guarantee write atomicity, the dm-integrity target uses journal, it writes sector data and integrity tags into a journal, commits the journal and then copies the data and integrity tags to their respective location.
If you combine dm-integrity with a mdadm RAID (RAID-1/mirror, RAID-5, or any other setup) you now have disk redundancy and error checking and error correction. dm-integrity will cause checksum errors when it encounters invalid data which mdadm notices and then repairs with correct data.
With mdadm, you specify the raid device to create, the raid mode level (raid0, raid1, raid10, raid5, raid6 etc) and the devices. mdadm is very well documented and it contains tons of options with examples as well, but it is also very easy to make mistakes with mdadm.
If you just want simple data integrity verification without any of the extra functionality that ZFS or Btrfs offers, then dm-integrity alone can do the job - you just need to run regular scrubs of the filesystem and then make sure you have adequate backup to handle any potential integrity problems.
Since I'm not using encryption in these tests I will use the integritysetup
command instead of the cryptsetup
command to format the disks. However, it is worth noticing that dm-integrity is best integrated with dm-crypt+LUKS for disk encryption.
By default, integritysetup
uses "crc32" which is relatively fast and requiring just 4 bytes per block. This gives the probability of a random corruption not being detected of about 2^32. This is then on top of any silent corruption on a hard drive.
With dm-integrity the devices needs to be wiped during format in order to avoid invalid checksums. As this takes extremely long time with the 1 TB disks, I have changed the disks to three old 160 GB disks and I'm just gonna use the shorthand sdX for device names (never do that, I am just doing it for the sake of test, always use device names with serial numbers for easy identification).
# integritysetup format --integrity sha256 /dev/sdb WARNING! ======== This will overwrite data on /dev/sdb irrevocably. Are you sure? (Type uppercase yes): YES WARNING: Device /dev/sdb already contains a 'dos' partition signature. Formatted with tag size 4, internal integrity sha256. Wiping device to initialize integrity checksum. You can interrupt this by pressing CTRL+c (rest of not wiped device will contain invalid checksum). Progress: 2.0%, ETA 49:12, 2991 MiB written, speed 50.3 MiB/s
Then opening the devices:
# integritysetup open --integrity sha256 /dev/sdb sdb # integritysetup open --integrity sha256 /dev/sdc sdc # integritysetup open --integrity sha256 /dev/sdd sdd
And creating the mdadm RAID-5 system:
# mdadm --create --verbose --assume-clean --level=5 --raid-devices=3 /dev/md/raid5 /dev/mapper/sdb /dev/mapper/sdc /dev/mapper/sdd mdadm: layout defaults to left-symmetric mdadm: layout defaults to left-symmetric mdadm: chunk size defaults to 512K mdadm: size set to 154882048K mdadm: automatically enabling write-intent bitmap on large pool mdadm: Defaulting to version 1.2 metadata mdadm: pool /dev/md/raid5 started.
Then create the ext4 filesystem on top of that:
# mkfs.ext4 /dev/md/raid5
In the above I have just used the defaults and I didn't calculate the correct stripe width and stride for a mdadm RAID-5 setup.
Time to get some status information:
$ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 dm-2[2] dm-1[1] dm-0[0] 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] bitmap: 0/2 pages [0KB], 65536KB chunk unused devices: <none>
# mdadm --misc -D /dev/md/raid5 /dev/md/raid5: Version : 1.2 Creation Time : Sat Apr 27 04:22:29 2019 Raid Level : raid5 Array Size : 309764096 (295.41 GiB 317.20 GB) Used Dev Size : 154882048 (147.71 GiB 158.60 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sat Apr 27 04:24:23 2019 State : clean Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Name : testbox:raid5 (local to host testbox) UUID : 43262b9e:9d039a22:539d7bbf:be558b32 Events : 2 Number Major Minor RaidDevice State 0 254 0 0 active sync /dev/dm-0 1 254 1 1 active sync /dev/dm-1 2 254 2 2 active sync /dev/dm-2
Then just to validate dm-integrity is setup correctly (here just on the sdb disk):
# ls /sys/block/md127/integrity/ device_is_integrity_capable
# ls /sys/block/sdb/integrity/ device_is_integrity_capable
# ls /sys/block/dm-0/integrity/ device_is_integrity_capable
# dmsetup info /dev/mapper/sdb Name: sdb State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 254, 0 Number of targets: 1 UUID: CRYPT-INTEGRITY-sdb ...
So I am up and running with a mdadm RAID-5 setup with dm-integrity.
I have mounted the mdadm device on /pub and it's time to transfer some files from the client using rsync
:
$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ 1.pdf 18,576,345 100% 245.62MB/s 0:00:00 (xfr#1, to-chk=6/8) 2.pdf 30,255,102 100% 72.86MB/s 0:00:00 (xfr#2, to-chk=5/8) 3.pdf 22,016,195 100% 27.41MB/s 0:00:00 (xfr#3, to-chk=4/8) bar.mkv 35,456,180,485 100% 30.32MB/s 0:18:35 (xfr#4, to-chk=3/8) boo.iso 32,768 0% 744.19kB/s 0:14:00 625,338,368 100% 5.70MB/s 0:01:44 (xfr#5, to-chk=2/8) foo.mkv 1,548,841,922 100% 56.08MB/s 0:00:26 (xfr#6, to-chk=1/8) moo.iso 415,633,408 100% 7.12MB/s 0:00:55 (xfr#7, to-chk=0/8) Number of files: 8 (reg: 7, dir: 1) Number of created files: 8 (reg: 7, dir: 1) Number of deleted files: 0 Number of regular files transferred: 7 Total file size: 38,116,841,825 bytes Total transferred file size: 38,116,841,825 bytes Literal data: 38,116,841,825 bytes Matched data: 0 bytes File list size: 0 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 38,126,148,151 Total bytes received: 202 sent 38,126,148,151 bytes received 202 bytes 28,938,253.02 bytes/sec total size is 38,116,841,825 speedup is 1.00
This was painfully slow, but these are some very old 2.5" 5400 RPM laptop disks I'm am using.
Compared to the ZFS RAID-Z transfer:
sent 38,126,148,150 bytes received 202 bytes 106,945,717.68 bytes/sec
And the Btrfs RAID-5 transfer:
sent 38,126,148,151 bytes received 202 bytes 102,078,041.11 bytes/sec
During the transfer top
showed dm-integrity working with "md127_raid5" occasionally spiking to 100% cpu usage:
%CPU %MEM TIME+ COMMAND 31.2 0.0 1:42.90 md127_raid5 9.0 0.3 0:19.28 smbd 2.0 0.0 0:00.19 kworker/u8:1+dm-integrity-wait 1.3 0.0 0:00.89 kworker/1:22+dm-integrity-writer 1.0 0.0 0:00.58 kworker/1:28+dm-integrity-writer 0.7 0.0 0:04.09 kworker/u8:0+flush-9:127 0.7 0.0 0:02.51 kworker/u8:2-dm-integrity-wait 0.7 0.0 0:00.59 kworker/1:17-dm-integrity-metadata 0.3 0.0 0:05.31 kworker/0:1H-kblockd 0.3 0.0 0:00.19 kworker/1:30-dm-integrity-metadata 0.3 0.0 0:00.20 kworker/1:31+dm-integrity-commit
mdadm - Power outage
I have then again added the file zoo.mkv to the files on the client and will begin the rsync
transfer and then pull the power cord at about 50%.
$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list zoo.mkv 5,887,590,400 38% 91.19kB/s 20:22:4 ^C
The power cord has been pulled. The transfer aborted on the client and the mdadm machine has been powered back up. Because I'm using dm-integrity I have to remember to open up the devices:
# integritysetup open --integrity sha256 /dev/sdb sdb # integritysetup open --integrity sha256 /dev/sdc sdc # integritysetup open --integrity sha256 /dev/sdd sdd
Now I can take a look at the state of the system:
$ dmesg [ 82.332815] md/raid:md127: not clean -- starting background reconstruction [ 82.332838] md/raid:md127: device dm-2 operational as raid disk 2 [ 82.332839] md/raid:md127: device dm-1 operational as raid disk 1 [ 82.332840] md/raid:md127: device dm-0 operational as raid disk 0 [ 82.333329] md/raid:md127: raid level 5 active with 3 out of 3 devices, algorithm 2
It's working. The filesystem was shut down in an unclean fashion and top shows a kworker process busy cleaning up:
$ top -n 1 %CPU $MEM TIME+ COMMAND 6.2 0.0 0:00.58 kworker/1:1H-kblockd 0.0 0.1 0:00.53 systemd 0.0 0.0 0:00.00 kthreadd 0.0 0.0 0:00.00 rcu_gp 0.0 0.0 0:00.00 rcu_par_gp 0.0 0.0 0:00.08 kworker/0:0-dm-integrity-metadata 0.0 0.0 0:00.00 kworker/0:0H-kblockd
mdstat however doesn't show anything useful:
$ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 dm-2[2] dm-1[1] dm-0[0] 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] bitmap: 0/2 pages [0KB], 65536KB chunk unused devices: <none>
Rather, with mdadm we need to pay close attention to dmesg
or the log:
# dmesg [ 190.074041] EXT4-fs (md127): recovery completed
So now I can mount the RAID-5 and take a look at the directory:
# mount /dev/md/raid5 /pub $ ls -gG /pub -rwxrw-r-- 1 18576345 Apr 21 09:08 1.pdf -rwxrw-r-- 1 30255102 Apr 21 09:08 2.pdf -rwxrw-r-- 1 22016195 Apr 21 09:08 3.pdf -rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv -rwxrw-r-- 1 625338368 Mar 5 2018 boo.iso -rwxrw-r-- 1 1548841922 Apr 15 23:50 foo.mkv -rwxrw-r-- 1 415633408 Mar 5 2018 moo.iso -rwxrw-r-- 1 0 Apr 27 19:51 .zoo.mkv.pH62pr
Traces of the broken rsync
transfer remains in the directory.
mdadm - Drive failure
Time to simulate the drive failure during the transfer of the zoo.mkv file. I will again remove a drive from the pool and see how md-integrity+mdadm handles that.
$ dmesg [ 865.320920] ata4: SATA link down (SStatus 0 SControl 300) [ 865.320926] ata4.00: disabled [ 865.320942] sd 3:0:0:0: rejecting I/O to offline device [ 865.320945] print_req_error: I/O error, dev sdb, sector 712 flags 801 [ 865.320951] print_req_error: I/O error, dev sdb, sector 712 flags 801 [ 865.320956] device-mapper: integrity: Error on writing journal: -5 [ 865.320966] sd 3:0:0:0: rejecting I/O to offline device [ 865.320968] print_req_error: I/O error, dev sdb, sector 131455112 flags 0 [ 865.320979] sd 3:0:0:0: rejecting I/O to offline device [ 865.320981] print_req_error: I/O error, dev sdb, sector 131455120 flags 0 [ 865.320982] md: super_written gets error=10 [ 865.320986] md/raid:md127: Disk failure on dm-0, disabling device. md/raid:md127: Operation continuing on 2 devices. [ 865.321003] md/raid:md127: read error not correctable (sector 130306048 on dm-0). [ 865.321012] md/raid:md127: read error not correctable (sector 130306056 on dm-0). [ 865.321019] md/raid:md127: read error not correctable (sector 130306064 on dm-0). [ 865.321026] md/raid:md127: read error not correctable (sector 130306072 on dm-0). [ 865.321033] md/raid:md127: read error not correctable (sector 130306080 on dm-0). [ 865.321040] md/raid:md127: read error not correctable (sector 130306088 on dm-0). [ 865.321047] md/raid:md127: read error not correctable (sector 130306096 on dm-0). [ 865.321054] md/raid:md127: read error not correctable (sector 130306104 on dm-0). [ 865.321061] md/raid:md127: read error not correctable (sector 130306112 on dm-0). [ 865.321063] ata4.00: detaching (SCSI 3:0:0:0) [ 865.321069] md/raid:md127: read error not correctable (sector 130306120 on dm-0). [ 865.323206] sd 3:0:0:0: [sdb] Synchronizing SCSI cache [ 865.323245] sd 3:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [ 865.323247] sd 3:0:0:0: [sdb] Stopping disk [ 865.323258] sd 3:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_O
dmesg
shows unrecoverable read errors.
Let's look at mdadm:
# mdadm --misc -D /dev/md/raid5 /dev/md/raid5: Version : 1.2 Creation Time : Sat Apr 27 19:15:28 2019 Raid Level : raid5 Array Size : 309764096 (295.41 GiB 317.20 GB) Used Dev Size : 154882048 (147.71 GiB 158.60 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sat Apr 27 20:07:24 2019 State : active, degraded Active Devices : 2 Working Devices : 2 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Name : testbox:raid5 (local to host testbox) UUID : c6183471:4e732124:11d110d8:1acd5593 Events : 135 Number Major Minor RaidDevice State - 0 0 0 removed 1 254 1 1 active sync /dev/dm-1 2 254 2 2 active sync /dev/dm-2 0 254 0 - faulty /dev/dm-0
mdstat also shows the drive as missing (there should be three U's):
$ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 dm-2[2] dm-1[1] dm-0[0](F) 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU] bitmap: 1/2 pages [4KB], 65536KB chunk unused devices: <none>
Now I need to figure out which device dm-0 is, just so I can keep track (this is where you always need to make sure you use the serial number of the device in order to keep physical track too):
$ ls -l /dev/disk/by-id/ dm-name-sdb -> ../../dm-0
I haven't got any room for a spare device so I'll mark the device as failed and remove it:
# mdadm --manage /dev/md/raid5 --fail /dev/dm-0 mdadm: set /dev/dm-0 faulty in /dev/md/raid5
# mdadm --manage /dev/md/raid5 --remove /dev/dm-0 mdadm: hot removed /dev/dm-0 from /dev/md/raid5
# mdadm --misc -D /dev/md/raid5 ... Number Major Minor RaidDevice State - 0 0 0 removed 1 254 1 1 active sync /dev/dm-1 2 254 2 2 active sync /dev/dm-
Then I'll shutdown the machine, attach a new disk, format it, and bring that into the pool as a replacement.
Looking at the "by-id" I can verify that the new drive I have attached haven't messed up the device mapping.
# ls -l /dev/disk/by-id/ ata-ST9160821AS_5MA7BFKV -> ../../sdb ata-ST9160821AS_5MA7BFKV-part1 -> ../../sdb1
The new drive already contains a filesystem so I need to format and wipe it:
# integritysetup format --integrity sha256 /dev/sdb WARNING! ======== This will overwrite data on /dev/sdb irrevocably. Are you sure? (Type uppercase yes): YES WARNING: Device /dev/sdb already contains a 'dos' partition signature. Formatted with tag size 4, internal integrity sha256. Wiping device to initialize integrity checksum. You can interrupt this by pressing CTRL+c (rest of not wiped device will contain invalid checksum). Progress: 0.3%, ETA 65:30, 509 MiB written, speed 38.4 MiB/s
Then once it's done I open the other drives:
# integritysetup open --integrity sha256 /dev/sdc sdc # integritysetup open --integrity sha256 /dev/sdd sdd
With only the two working drives mdadm shows the dm-0 is missing:
# mdadm --misc -D /dev/md/raid5 ... Number Major Minor RaidDevice State - 0 0 0 removed 1 254 1 1 active sync /dev/dm-1 2 254 2 2 active sync /dev/dm-2
Then I open sdb and attach it to the RAID:
# integritysetup open --integrity sha256 /dev/sdb sdb # mdadm --add /dev/md/raid5 /dev/mapper/sdb mdadm: added /dev/mapper/sdb
Now mdadm brings the RAID-5 pool into sync:
$ cat /proc/mdstat md127 : active raid5 dm-0[3] dm-2[2] dm-1[1] 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU] [>....................] recovery = 0.3% (498968/154882048) finish=165.0min speed=15592K/sec bitmap: 1/2 pages [4KB], 65536KB chunk unused devices: <none>
ZFS implements a very sophisticated block tracking mechanism so it is capable of knowing exactly which blocks it needs to reconstruct. This means that ZFS only reconstructs the used blocks and it is extremely fast at reconstructing - especially if the disks doesn't contain much data. Btrfs is slower than ZFS, but it also only reconstructs the used blocks.
mdadm on the other hand reconstructs every single block on the disk, including the non-used ones, which makes the reconstruction process extremely slow even on small disks.
Eventually mdadm is done:
$ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 dm-0[3] dm-2[2] dm-1[1] 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] bitmap: 0/2 pages [0KB], 65536KB chunk unused devices: <none>
And dmesg
confirms this:
[17741.968738] md: md127: recovery done.
This took a very long time.
mdadm - Drive failure during file transfer
Now that the RAID-5 pool has been restored I'm going to remove a drive during a file transfer and then put it back in, and see how mdadm+dm-integrity behaves.
During the transfer the log shows that the drive is gone:
kernel: print_req_error: I/O error, dev sdd, sector 203305096 flags 0 kernel: device-mapper: integrity: Error on reading tags: -5 kernel: sd 6:0:0:0: rejecting I/O to offline device kernel: print_req_error: I/O error, dev sdd, sector 46430728 flags 0 kernel: sd 6:0:0:0: [sdd] tag#2 CDB: Read(10) 28 00 02 c4 7a 08 00 00 80 00 kernel: sd 6:0:0:0: [sdd] tag#2 Add. Sense: Unaligned write command kernel: sd 6:0:0:0: [sdd] tag#2 Sense Key : Illegal Request [current] kernel: sd 6:0:0:0: [sdd] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE kernel: ata7.00: disabled kernel: ata7: SATA link down (SStatus 0 SControl 300)
mdadm reacted by halting the file transfer for about a second, then resumed the transfer without the client being able to notice anything other than the momentary drop in the file transfer speed.
$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list zoo.mkv 10,867,033,488 100% 35.52MB/s 0:04:51 (xfr#3, to-chk=0/9) Number of files: 9 (reg: 8, dir: 1) Number of created files: 1 (reg: 1) Number of deleted files: 0 Number of regular files transferred: 3 Total file size: 48,983,875,313 bytes Total transferred file size: 10,919,304,785 bytes Literal data: 10,919,304,785 bytes Matched data: 0 bytes File list size: 0 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 10,921,970,962 Total bytes received: 76 sent 10,921,970,962 bytes received 76 bytes 27,969,196.00 bytes/sec total size is 48,983,875,313 speedup is 4.48
In the ZFS and Btrfs tests I always shutdown the machine before I re-attached the removed drive, but this time I have just re-attached the drive. mdadm shows the drive as missing:
$ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 dm-0[3] dm-2[2](F) dm-1[1] 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_] bitmap: 1/2 pages [4KB], 65536KB chunk unused devices: <none>
# mdadm --misc -D /dev/md/raid5 ... Number Major Minor RaidDevice State 3 254 0 0 active sync /dev/dm-0 1 254 1 1 active sync /dev/dm-1 - 0 0 2 removed 2 254 2 - faulty /dev/dm-2
So I'll have to remove the device from mdadm, close the device, then open it again and re-attach it:
# mdadm --manage /dev/md/raid5 --remove /dev/dm-2 mdadm: hot removed /dev/dm-2 from /dev/md/raid5 # mdadm --misc -D /dev/md/raid5 ... Number Major Minor RaidDevice State 3 254 0 0 active sync /dev/dm-0 1 254 1 1 active sync /dev/dm-1 - 0 0 2 removed $ ls -l /dev/disk/by-id/ dm-name-sdd -> ../../dm-2 # integritysetup close sdd # integritysetup open --integrity sha256 /dev/sdd sdd # mdadm --manage /dev/md/raid5 --re-add /dev/dm-2 mdadm: re-added /dev/dm-2
mdadm automatically begins the recovery process:
$ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 dm-2[2] dm-0[3] dm-1[1] 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_] [==>..................] recovery = 14.9% (23132796/154882048) finish=184.9min speed=11872K/sec bitmap: 1/2 pages [4KB], 65536KB chunk unused devices: <none>
# mdadm --misc -D /dev/md/raid5 ... Number Major Minor RaidDevice State 3 254 0 0 active sync /dev/dm-0 1 254 1 1 active sync /dev/dm-1 2 254 2 2 spare rebuilding /dev/dm-2
As with ZFS and Btrfs any mdadm pool should be "scrubbed" at regular intervals.
On mdadm this basically involves reading the entire pool, such that any problems with the drive will trigger a read error and auto-correction, and any problems with the data will be picked up. It's controlled by writing to the "sync_action" parameter in /sys:
# echo check > /sys/block/md127/md/sync_action # dmesg [20520.024219] md: data-check of RAID pool md127
Notice the difference from the above mdstat "recovery" message to the "check" message:
# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 dm-2[2] dm-0[3] dm-1[1] 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] [===>.................] check = 17.1% (26578480/154882048) finish=121.1min speed=17642K/sec bitmap: 2/2 pages [8KB], 65536KB chunk unused devices: <none>
Again, compared to both ZFS and Btrfs this took a very long time.
mdadm - Data corruption during file transfer
Time to simulate the disk corruption in the middle of a file transfer from the client. I have removed the "zoo.mkv" file and while the rsync
command is running I will use the dd
command multiple times on the mdadm machine on one of the drives:
# dd if=/dev/urandom of=/dev/mapper/sdb seek=3000 count=30 bs=1k
UPDATE 2019-08-27: This is where I made the mistake. Writing to /dev/mapper/sdb
also updates the dm-integrity checksum. The correct test should have been to write random data to /dev/sdb
.
At first I misunderstood the behavior of dm-integrity and I was expecting to see something like this in the log:
device-mapper: INTEGRITY AEAD ERROR, sector 39784
But I didn't.
According to the DMintegrity documentation I should be seeing an integrity failures count, but that would of course require that I was hitting a sector that was being read from! Which wasn't the case in my example.
So I need to run a scrub:
# echo check > /sys/block/md127/md/sync_action
Mismatching checksums are now found:
# tail -f | dmesg [ 980.626850] md: data-check of RAID pool md127 [ 982.490908] md127: mismatch sector in range 38120-38128 [ 982.490911] md127: mismatch sector in range 38128-38136 [ 982.490913] md127: mismatch sector in range 38144-38152 [ 982.490914] md127: mismatch sector in range 38152-38160 [ 982.490916] md127: mismatch sector in range 38160-38168 [ 982.490917] md127: mismatch sector in range 38168-38176 [ 982.490918] md127: mismatch sector in range 38048-38056 [ 982.490919] md127: mismatch sector in range 38056-38064 [ 982.490922] md127: mismatch sector in range 38064-38072 [ 982.490923] md127: mismatch sector in range 38072-38080
# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 dm-2[1] dm-1[2] dm-0[3] 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] [=>...................] check = 7.0% (10876736/154882048) finish=139.0min speed=17257K/sec bitmap: 1/2 pages [4KB], 65536KB chunk unused devices: <none>
And after a very long time mdadm has fixed the pool:
# dmesg [13943.389533] md: md127: data-check done.
# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 dm-2[1] dm-1[2] dm-0[3] 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] bitmap: 0/2 pages [0KB], 65536KB chunk unused devices: <none>
# mdadm --misc -D /dev/md/raid5 /dev/md/raid5: Version : 1.2 Creation Time : Sat Apr 27 19:15:28 2019 Raid Level : raid5 Array Size : 309764096 (295.41 GiB 317.20 GB) Used Dev Size : 154882048 (147.71 GiB 158.60 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Tue Apr 30 03:42:55 2019 State : clean Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Name : testbox:raid5 (local to host testbox) UUID : c6183471:4e732124:11d110d8:1acd5593 Events : 4454 Number Major Minor RaidDevice State 3 254 0 0 active sync /dev/dm-0 1 254 2 1 active sync /dev/dm-2 2 254 1 2 active sync /dev/dm-1
I can mount the pool on the client and access the files:
$ ls -gG tmp total 47835856 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf -rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv -rwxr-xr-x 1 625338368 Mar 5 2018 boo.iso -rwxr-xr-x 1 1548841922 Apr 15 23:50 foo.mkv -rwxr-xr-x 1 415633408 Mar 5 2018 moo.iso -rwxr-xr-x 1 10867033488 Apr 22 21:10 zoo.mkv
mdadm - The dd mistake
Time to see what happens if the sysadmin by mistake issues the dd
command on one of the RAID drives. Again I have removed the "zoo.mkv" file and I am doing this during an active file transfer:
# dd if=/dev/urandom of=/dev/mapper/sdb bs=1M ^C89745+0 records in 89745+0 records out 91898880 bytes (92 MB, 88 MiB) copied, 26.8636 s, 3.4 MB/s
UPDATE 2019-08-27: This is where I made the mistake again. Writing to /dev/mapper/sdb
also updates the dm-integrity checksum. The correct test should have been to write random data to /dev/sdb
.
Not as much as in the tests with ZFS or Btrfs, but this should still suffice.
Nothing noticeable happened on the client. The files keeps going:
$ rsync -a --progress --stats /home/naim/tmp/ /home/naim/ sending incremental file list ./ zoo.mkv 10,223,976,448 94% 31.22MB/s 0:00:20
At this point in the tests ZFS had detected checksum errors and the message "One or more devices has experienced an unrecoverable error". With mdadm there isn't really anything to go by until you run a scrub:
# echo check > /sys/block/md127/md/sync_action # tail -f | dmesg [18709.603919] md: data-check of RAID pool md127 [18710.296237] md127: mismatch sector in range 152-160 [18710.296240] md127: mismatch sector in range 144-152 [18710.296242] md127: mismatch sector in range 136-144 [18710.296243] md127: mismatch sector in range 128-136 [18710.296245] md127: mismatch sector in range 120-128 [18710.296248] md127: mismatch sector in range 112-120 [18710.296249] md127: mismatch sector in range 104-112 [18710.296251] md127: mismatch sector in range 96-104 [18710.296261] md127: mismatch sector in range 88-96 [18710.296263] md127: mismatch sector in range 80-88 [18715.299379] handle_parity_checks5: 25825 callbacks suppressed [18715.299381] md127: mismatch sector in range 195112-195120 [18715.299652] md127: mismatch sector in range 195096-195104 [18715.299836] md127: mismatch sector in range 195176-195184 [18715.299890] md127: mismatch sector in range 195168-195176 [18715.299945] md127: mismatch sector in range 195152-195160 [18715.300001] md127: mismatch sector in range 195144-195152 [18715.300048] md127: mismatch sector in range 195192-195200 [18715.300106] md127: mismatch sector in range 195160-195168 [18715.300143] md127: mismatch sector in range 195136-195144 [18715.300180] md127: mismatch sector in range 195184-195192 [18720.302862] handle_parity_checks5: 24559 callbacks suppressed ...
# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 dm-2[1] dm-1[2] dm-0[3] 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] [>....................] check = 3.0% (4786740/154882048) finish=125.7min speed=19891K/sec bitmap: 0/2 pages [0KB], 65536KB chunk unused devices: <none>
After the scrub is finally done:
# dmesg [31789.569052] md: md127: data-check done
# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 dm-2[1] dm-1[2] dm-0[3] 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] bitmap: 0/2 pages [0KB], 65536KB chunk unused devices: <none>
# mdadm --misc -D /dev/md/raid5 /dev/md/raid5: Version : 1.2 Creation Time : Sat Apr 27 19:15:28 2019 Raid Level : raid5 Array Size : 309764096 (295.41 GiB 317.20 GB) Used Dev Size : 154882048 (147.71 GiB 158.60 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon Apr 29 00:41:18 2019 State : clean Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Name : testbox:raid5 (local to host testbox) UUID : c6183471:4e732124:11d110d8:1acd5593 Events : 3457 Number Major Minor RaidDevice State 3 254 0 0 active sync /dev/dm-0 1 254 2 1 active sync /dev/dm-2 2 254 1 2 active sync /dev/dm-1
I did this test multiple times on the mdadm system. The second time I did it everything was fixed at this point and the pool continued to run and I could mount and use the filesystem. But the first time I did the test things didn't go as smooth! I got mount errors:
# mount /dev/md/raid5 /pub/ mount: /pub: wrong fs type, bad option, bad superblock on /dev/md127, missing codepage or helper program, or other error.
I tried rebooting the machine, re-open the drives, and then assemble, but the problem persisted.
# mdadm --stop /dev/md/raid5 mdadm: stopped /dev/md/raid5 # mdadm --assemble --scan mdadm: /dev/md/raid5 has been started with 3 drives. # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : active raid5 dm-0[3] dm-1[2] dm-2[1] 309764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] bitmap: 0/2 pages [0KB], 65536KB chunk unused devices: <none> # mount /dev/md/raid5 /pub/ mount: /pub: wrong fs type, bad option, bad superblock on /dev/md127, missing codepage or helper program, or other error.
The drives are all registered as clean:
# mdadm --examine /dev/mapper/sdb /dev/mapper/sdc /dev/mapper/sdd /dev/mapper/sdb: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : c6183471:4e732124:11d110d8:1acd5593 Name : testbox:raid5 (local to host testbox) Creation Time : Sat Apr 27 19:15:28 2019 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 309764392 (147.71 GiB 158.60 GB) Array Size : 309764096 (295.41 GiB 317.20 GB) Used Dev Size : 309764096 (147.71 GiB 158.60 GB) Data Offset : 264192 sectors Super Offset : 8 sectors Unused Space : before=264112 sectors, after=296 sectors State : clean Device UUID : 84c226cb:7d10c97d:8820a12f:24e28510 Internal Bitmap : 8 sectors from superblock Update Time : Mon Apr 29 00:41:18 2019 Bad Block Log : 512 entries available at offset 16 sectors Checksum : e4546d90 - correct Events : 3457 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 0 Array State : AAA ('A' == active, '.' == missing, 'R' == replacing) /dev/mapper/sdc: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : c6183471:4e732124:11d110d8:1acd5593 Name : testbox:raid5 (local to host testbox) Creation Time : Sat Apr 27 19:15:28 2019 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 309764392 (147.71 GiB 158.60 GB) Array Size : 309764096 (295.41 GiB 317.20 GB) Used Dev Size : 309764096 (147.71 GiB 158.60 GB) Data Offset : 264192 sectors Super Offset : 8 sectors Unused Space : before=264112 sectors, after=296 sectors State : clean Device UUID : 924b5275:8d6bee5c:1d049510:60f6103f Internal Bitmap : 8 sectors from superblock Update Time : Mon Apr 29 00:41:18 2019 Bad Block Log : 512 entries available at offset 16 sectors Checksum : 7d24497d - correct Events : 3457 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 1 Array State : AAA ('A' == active, '.' == missing, 'R' == replacing) /dev/mapper/sdd: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : c6183471:4e732124:11d110d8:1acd5593 Name : testbox:raid5 (local to host testbox) Creation Time : Sat Apr 27 19:15:28 2019 Raid Level : raid5 Raid Devices : 3 Avail Dev Size : 309764392 (147.71 GiB 158.60 GB) Array Size : 309764096 (295.41 GiB 317.20 GB) Used Dev Size : 309764096 (147.71 GiB 158.60 GB) Data Offset : 264192 sectors Super Offset : 8 sectors Unused Space : before=264112 sectors, after=296 sectors State : clean Device UUID : 121eecb0:459a731c:9a45a4e9:b42b46d7 Internal Bitmap : 8 sectors from superblock Update Time : Mon Apr 29 00:41:18 2019 Bad Block Log : 512 entries available at offset 16 sectors Checksum : e987c188 - correct Events : 3457 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 2 Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
This is clearly a filesystem error and it seems like I might be dealing with The Bad Block Controversy.
From the documentation I get the following options:
The first thing is to try and check the integrity of your file system. A command like "tar cf / > /dev/null" will read the entire file system and tell you if any files are unreadable. It should also clear any Bad Blocks that have data on them but are recoverable. However, this is a known bug - that doesn't always happen.
But the Bad Blocks may be on an unallocated portion of the file system. If you wish to clear that, try a command like "cat /dev/zero > /tempfile &; rm /tempfile". This will fill all your spare disk space with zeroes, then delete the file it used to do so.
After both these things have been done, your Bad Blocks list should be empty. However, both these commands are very disk-heavy, and will take a very long time on a modern pool. Plus the code is strongly suspected to be buggy so these commands could very likely not work.
If you are satisfied that everything is okay, and you don't want the Bad Blocks functionality, the easy way to get rid of it (if you have no Bad Blocks list to clear) is "mdadm ... --assemble --update=no-bbl".
If, however, you do have an active Bad Blocks list with sectors in it, this command won't work. You can use the command "mdadm ... --assemble --update=force-no-bbl" to delete the list, but this will now mean that mdadm will probably return garbage where before it failed with an error. If you're satisfied that your file system is intact, though, this won't matter to you.
In my specific case none of the above are going to work and the only way forward is to actually clean the ext4 filesystem and hope for the best:
# fsck /dev/md/raid5 ... Inode 523773 seems to contain garbage. Clear? yes Inode 523774 seems to contain garbage. Clear? yes Inode 523775 seems to contain garbage. Clear? yes ...
# top %CPU %MEM TIME+ COMMAND 23.3 0.1 0:18.20 fsck.ext4 3.3 0.0 0:00.58 kworker/0:28-dm-integrity-metadata 3.0 0.0 0:01.51 kworker/1:5-dm-integrity-metadata 3.0 0.0 0:00.66 kworker/0:25-dm-integrity-metadata 2.7 0.0 0:01.34 kworker/0:6-dm-integrity-metadata 2.7 0.0 0:00.80 kworker/1:19-dm-integrity-metadata 2.7 0.0 0:00.75 kworker/1:34+dm-integrity-metadata ... /dev/md127: ***** FILE SYSTEM WAS MODIFIED ***** /dev/md127: 21/19365888 files (14.3% non-contiguous), 13453202/77441024 blocks
Tons and tons of errors. I will now mount the pool and check the files:
# mount /dev/md/raid5 /pub # cd /pub # ls -a . .. lost+found
The tmp directory is gone!
# cd lost+found # du -h 4.0K ./#11 46G ./#12058625 # cd \#12058625/ # ls -la total 47835856 -rwxrw-r-- 1 18576345 Apr 21 09:08 1.pdf -rwxrw-r-- 1 30255102 Apr 21 09:08 2.pdf -rwxrw-r-- 1 22016195 Apr 21 09:08 3.pdf -rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv -rwxrw-r-- 1 625338368 Mar 5 2018 boo.iso -rwxrw-r-- 1 1548841922 Apr 15 23:50 foo.mkv -rwxrw-r-- 1 415633408 Mar 5 2018 moo.iso -rwxrw-r-- 1 10867033488 Apr 22 21:10 zoo.mkv
However, the files are still there.
I decided to stop testing mdadm+dm-integrity here as the last test with the failure of the second drive during a restoration would be rather pointless and very time consuming.
A correct test of mdadm+dm-integrity
As mentioned in the beginning, when I tested for data integrity errors I wrote to /dev/mapper/sdb
which also updates the dm-integrity checksum. Later when I did the sync-action check, the errors displayed weren't the dm-integrity checksum errors, but rather the RAID 5 parity errors. The correct test should have been to write random data to /dev/sdb
.
The test below was sent to me by Philip who was kind enough to write to me about the mistake I made, and also provide an example of a test where then random data is written to the correct device.
# integritysetup format --integrity sha256 /dev/nvme0n1 # integritysetup format --integrity sha256 /dev/nvme0n2 # integritysetup format --integrity sha256 /dev/nvme0n3 # integritysetup open --integrity sha256 /dev/nvme0n1 nvme0n1 # integritysetup open --integrity sha256 /dev/nvme0n2 nvme0n2 # integritysetup open --integrity sha256 /dev/nvme0n3 nvme0n3 # mdadm --create --verbose --level=5 --raid-devices=3 /dev/md0 /dev/mapper/nvme0n1 /dev/mapper/nvme0n2 /dev/mapper/nvme0n3 # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 dm-6[3] dm-5[1] dm-4[0] 8253440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] unused devices: <none>
# mkfs.xfs /dev/md0 # mount /dev/md0 /mnt/md0 # dd if=/dev/zero of=/mnt/md0/test.data bs=1G count=7 7+0 records in 7+0 records out 7516192768 bytes (7.5 GB, 7.0 GiB) copied, 156.89 s, 47.9 MB/s
# sha256sum test.data 5e002ad6567bbdc9d43cc140cc509a592838457a56df571cd203230c7a56f241 test.data
# dd if=/dev/urandom of=/dev/nvme0n1 seek=3000 bs=1k ^C641763+0 records in 641762+0 records out 657164288 bytes (657 MB, 627 MiB) copied, 20.8391 s, 31.5 MB/s
# sha256sum test.data 5e002ad6567bbdc9d43cc140cc509a592838457a56df571cd203230c7a56f241 test.data ... Aug 26 23:03:16 localhost.localdomain kernel: device-mapper: integrity: Checksum failed at sector 0x12faf0 Aug 26 23:03:16 localhost.localdomain kernel: device-mapper: integrity: Checksum failed at sector 0x12faf8 Aug 26 23:03:16 localhost.localdomain kernel: device-mapper: integrity: Checksum failed at sector 0x12fb00 Aug 26 23:03:16 localhost.localdomain kernel: device-mapper: integrity: Checksum failed at sector 0x12fb08 Aug 26 23:03:16 localhost.localdomain kernel: device-mapper: integrity: Checksum failed at sector 0x12fb10 Aug 26 23:03:16 localhost.localdomain kernel: device-mapper: integrity: Checksum failed at sector 0x12fb18 Aug 26 23:03:16 localhost.localdomain kernel: device-mapper: integrity: Checksum failed at sector 0x12fb20 Aug 26 23:03:16 localhost.localdomain kernel: device-mapper: integrity: Checksum failed at sector 0x12fb28 Aug 26 23:03:16 localhost.localdomain kernel: device-mapper: integrity: Checksum failed at sector 0x12fb30 Aug 26 23:03:16 localhost.localdomain kernel: device-mapper: integrity: Checksum failed at sector 0x12fb38 Aug 26 23:03:16 localhost.localdomain kernel: device-mapper: integrity: Checksum failed at sector 0x12fb40 Aug 26 23:03:16 localhost.localdomain kernel: device-mapper: integrity: Checksum failed at sector 0x12fb48 ...
# sha256sum test.data 5e002ad6567bbdc9d43cc140cc509a592838457a56df571cd203230c7a56f241 test.data no dm integrity error output
Thank you very much for sharing this Philip!
Final notes
With mdadm you can setup any kind of RAID pool for disk redundancy and you can even expand with further disks if later needed. Then you can add dm-integrity for error detection and error correction at the block level. If you add dm-crypt+LUKS to that the data integrity is protected with native authenticated encryption. And if you use the RAID pool as LVM physical volumes you can do snapshots too. But it cannot by a long shot compare to ZFS or Btrfs!
And I would never use something like mdadm+dm-integrity as a replacement for ZFS or Btrfs.
It is true that you don't have to worry about compiling a kernel module every time the kernel is updated (as with ZFS on Linux), or using annoying Solaris compatibility layers, or to run with an external and un-official and often out-of-sync repository (Arch Linux developers, yes I'm talking to you! We need ZFS in the Arch Linux official repositories ASAP! Even Debian has ZFS in its repositories!). And you don't need to worry about any "write hole" issues as with Btrfs.
However, not only do you need to carefully study the documentation of each piece of technology you put together with mdadm and make sure you understand how you put these things together and how you best deal with potential problems, but you're still limited by the "regular" filesystem you put on top of all that, and you don't get any of really well designed and superior protection or management that ZFS or Btrfs provides.
Let me quote Allan Jude and Michael W. Lucas from their book "FreeBSD Mastery: ZFS":
ZFS is merely a filesystem, yes. But it includes features that many filesystems can't even contemplate.
ZFS is a copy-on-write filesystem that is extremely well designed and it is light years ahead of Btrfs. ZFS is also very easy to use. Yes, you are allowed to shoot yourself in the foot with ZFS, this is *NIX after all, and if you don't plan ahead you can also end up with a big mess, but then it is mostly your own fault. ZFS is very well documented, but with ZFS you almost know by intuition how a command needs to be constructed.
Another big advantage with ZFS is that it extremely reliable and very well battle tested in the industry. You'll find thousands of companies and regular people that have been using ZFS for a very long time. This means that it is much easier to get help if you need it.
Btrfs is also a copy-on-write filesystem and I believe it was supposed to be the "better ZFS" on Linux, but even though companies like Facebook has deployed Btrfs on millions of servers they only care about the specific functionally they need, which means that the RAID5/6 "write hole" issue still isn't fixed even after so many years!
Another concern with Btrfs is that a many people have reported serious issues about data loss with Btrfs and the Debian Linux Wiki contains many relevant and unresolved issues (as of writing) to be alarmed about.
Some such reports (not the ones on the Debian wiki) are the result of a mis-managed situation. Very often people deploy solutions without study and without testing. Then when errors happen they manage to blow everything up (by the trial-and-error approach), make things un-recoverable, and then blame the technology.
But many people - as in this example - have also reported only positive results with Btrfs. Also OpenSUSE is a Linux distribution sponsored by SUSE Linux GmbH and several other companies and it runs with Btrfs as the default filesystem for the root partition. And Synology is a company founded in 2000 that creates network-attached storage (NAS), IP surveillance solutions, and network equipment. Their NAS storage solutions are based upon the Btrfs filesystem.
Also even though ZFS is as mature as it is, it is still undergoing rapid and active development, with many new features continuously added. It has received a huge amount of work since the cooperation between all the different projects in OpenZFS and especially from the ZFS on Linux project. So much so that the FreeBSD developers decided to re-base their ZFS filesystem code on the "ZFS On Linux" port rather than the Illumos code where they originally had been acquiring the code from. This also means that while ZFS is getting new features added it is also experiencing new bugs and issues.
Personally I really love ZFS and it is without a doubt my favorite filesystem! It's an absolute amazing feat of engineering. But I also hope very much that Btrfs eventually will catch up as it has been improved a lot upon lately, and I don't think it deserves all the bad press it has gotten. I have deployed Btrfs in multiple setups without any issues and it has performed really well too.
Update 2020-01-23: I have since I wrote this article abandoned Btrfs completely. I have found no situation where it was viable to run Btrfs rather than ZFS on FreeBSD or even on GNU/Linux.
Anyway, I hope you have found this article worth the read.
Relevant reading
- ZFS on Wikipedia
- ORACLE ZFS documentation
- FreeBSD Handbook ZFS Chapter
- FreeBSD Mastery: ZFS
- FreeBSD Mastery: Advanced ZFS
- OpenZFS
- ZFS on Linux
- ZFS on Arch Linux Wiki
- ZFS on Gentoo Linux Wiki
- Btrfs on Wikipedia
- Btrfs on Debian Linux Wiki
- Btrfs on Arch Linux Wiki
- Btrfs on Gentoo Linux Wiki
- Btrfs on OpenSUSE Wiki
- Btrfs FAQ
- Linux RAID
- mdadm on Arch Linux Wiki
- dm-integrity