Adding Metadata Persistence for a Storage Cluster

Sam Li

2024-09-09

FOSS › Linux

Design
Development
Debugging

In recent months, I’ve worked on adding metadata persistence to an internal project. Excluding the details of codes, I’d like to review the methods I took when testing and debugging. System programming is intricate. It happens that taking hours of search leads to one line of code that causes the error. The kernel back traces generated may be far from the original cause. Or perhaps that’s just me, making some silly mistakes of mismatched units or freeing the wrong pointer. There is one kind of mistakes that is easy to overlook: the one that I think is correct.

The goal is to persist the metadata of many servers. In the design phase of that task, I had many meetings with my manager to discuss what, how and why. We’ve looked into two open-source projects for inspiration which are /dev/md and drbd. I’ll briefly discuss some topics I find it interesting.

Design¶

In the very beginning, we were thinking about where, what to store, state machine, cache policy, organization/layout of metadata and failures to handle… Above all, we would like to achieve strong consistency on the metadata and handle split brain case where network partition splits the communication in half.

Availability and redundancy are important components of system design. The request on the VM is not completed when it failed on some servers. While submitting a request, a server is down, decreasing the availability of the system. When a machine is crashed and if every server stores one copy at that machine, the redundancy of the system also goes down by one.

Flush methods¶

Does (10 times of 4KB + 1000 times of 32B) or (1010 times of (4KB+32B)) perform better when they are written back to disk?

If we break space into two parts, one of them will be updated immediately while the other will wait for a certain time window. The answer is that the discrepancy on performance is not that big. The interface of this project is designed in a way to pass on the data buffer to the block layer. The argument to decide is the transfer size of md IOs.
(one write IO + one md IO) or (several write IOs + one md IO)?
When to flush? Drbd syncs the activity log prior to the failure point. /dev/md syncs per time window. The time window of flushing metadata will be the most consistency that the project can reserve.

Drbd flushes a certain area (hot area). Activity log is to track recent data blocks that are written to. To ensure a quick recovery, the activity log gets synced first. Then the data blocks that expires from the AL (updated in the bitmap) get synced afterwards.

When data is written, and if the block is no longer in the AL, it is marked in the bitmap as it requires synchronization. Drbd has the primary and backup nodes. It does synchronization in the background in the event of the node failure. The synchronization in this case means to get synced from other nodes that is in the consistent state.

Synchronization policy¶

Having a little understanding of the context, we know the key to metadata persistence is synchronization.

(v8.1) Drbd uses timers to periodically do regular jobs like resyncs and writes. The metadata is stored at either the end of the device or a separate file.

When it makes a resync request, it checks if there are still application IOs going on this area and completes IOs then.
The data blocks tracked in the activity log can be transferred to bitmap and wait for synchronization.
When the metadata is marked as dirty, it does md_sync to flush the metadata part to disk.

(v9.1)

Why many active extents in the AL can reduce metadata writes?

Active extents are regions in the AL that are currently being written to. Each active extent can track a range of blocks. The AL uses active extents to track recent writes in memory. If there are more active extents, the fewer inactive extents are written at disk therefore it reduces metadata writes. The active extent becomes inactive when there are no new IOs written in after a period. An old extent on the disk gets activated in failure cases.
How to address this issue?

Drbd periodically flushes the inactive extents to disk. If the secondary machine crashes, the AL will be lost along with its active extents. Then in data recovery, it has to activate the old extent that is on the disk. But the data of those active extents which haven’t been flushed to disk will be lost.

Data Loss Window: There is an inherent risk of losing the most recent writes that were only tracked in the in-memory AL and not yet flushed to disk.

Fancy data structures?¶

What data structures can be used for this task? Should we consider b+ tree, log-structured merged tree? At the current stage, simple structures are enough.

Development¶

Tips:

Avoid premature optimizations: when working on the first version, we can omit the edge cases (failure cases) if they are not sorted out yet. That will also boost up the process of developing.
Speed up the development process until the regression tests pass. Do the cleanup work after passing the regression tests.

The first thing I did after most of the code is done is testing for basic correctness. Change some states and see whether it persists. Then I cleaned up the code which turned out to be unnecessary since a few errors showed up after running the regression tests.

Debugging¶

The process of debugging also strengthens my understanding of this project and trains my debugging skills through finding every root cause of the error. Let’s say the chain of clues is A->B->C->D->E, it’s evident to track A when the back traces are complete. But our job is to track A even without B, C and D. When I have no clue on a certain issue, a tip that a friend once taught me works well. The tip is binary search debugging. Knowing a project prior to the commit c1 is bug-free, break the remaining changes into two and test each part. Iterate this process unless the root cause is identified. It doesn’t have to be exactly in half. Just pick out the suspicious part and run the tests. With a little more patience, we’ll find the reason eventually.

Reference count¶

1	percpu ref (xxxxx) <= 0 (-15) after switching to atomic.

Reference counting is used by the kernel to know when a data structure is unused and can be disposed of. Most of the time, reference counts are represented by an atomic_t variable, perhaps wrapped by a structure like a kref. If references are added and removed frequently over an object’s lifetime, though, that atomic_t variable can become a performance bottleneck.^[1]

The core idea is to have a counter which is incremented whenever a new reference is taken and decremented when a reference is released. When this counter reaches zero any resources used by the object (such as the memory used to store it) can be freed.

The reference count issues are usually caused by mismatched put_ref() and get_ref(). It can be traced back to one extra operation or racing conditions. Firstly, reproduce the issue and add reference printing. Secondly, we need to check the code containing reference operations for the extra operation. It could be one get_ref() with two put_ref() at some point. For example, two duplicate returned paths which both call put_ref() while the input path calls get_ref() once. Learned from my manager, all of that has to be checked by reading the code, doing dry runs, and building sequence of execution steps on paper or in head.

Fio_verify: bad header¶

1	verify: bad header offset 366080, wanted 361984 at file /dev/xxx offset 361984, length 512

Fio_verify verifies whether the original input is stored properly on the storage. The server may have partial data. If something changes the offset of IOs, the data won’t be stored at the intended location, leading to the bad header error on fio_verify.

Work_queue mechanism¶

A case is a work queue appears where it should be destroyed. The first step is to check if the work queue get canceled. When it’s certain, we can check the conditions of canceling the work queue and reason the issue.

Submit_bio_wait err¶

[  162.458257] brd_submit_bio: bv offset 2729 len 1367 sector size 512
(...
[  162.473539] brd_submit_bio: bv offset 0 len 2729 sector size 512
...)
[  162.458682] ------------[ cut here ]------------
[  162.458980] WARNING: CPU: 5 PID: 708 at drivers/block/brd.c:303 brd_submit_bio.cold.14+0x106/0x143 [brd]
...
[  162.461362] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
...
[  162.462263] RIP: 0010:brd_submit_bio.cold.14+0x106/0x143 [brd]
...
[  162.468286] Call Trace:
[  162.468444]  submit_bio_noacct+0xfb/0x410
[  162.468717]  submit_bio+0x43/0x190
[  162.468938]  ? printk+0x58/0x73
[  162.469144]  submit_bio_wait+0x54/0xb0
...
[  162.473244] ---[ end trace 1cca44a5ad63415c ]---

This error is hard to solve because the offset and length of IOs are correct and no explicit errors show up. With the help of my manager, we found that the problem comes from the test script that uses ram devices as the block backend. After switching to SCSI disks, the kernel warning is gone since it no longer goes through that path.

We still need to find out how bio submission works for /dev/sda. It’s likely handled by QEMU virtio-blk or SCSI pass-through and submitted to the host kernel.

Reconnection timeout/crashes¶

The most difficult issue I’ve met during the debugging for md persistence is constant reconnection timeout which affects several tests. When one server has reconnection timeout and shortly crashes afterwards, it’s hard to track what’s going on exactly.

The first thing is to find the back traces in this very short time window right before the server crashes. There are several back traces but none of them seems to relate to the md persistence part. That is the case where only E is visible but B, C, an D are missing. I made few fixes to see if the back traces change only to get new back traces suggesting that the module outside incurs kernel warnings, which makes this issue even harder to reason. I had been going around in circles on that, running a few tests and getting no useful results.

It’s the time that I decided to apply binary search debugging. At first I left the md_sync part out and tested for the rest part which passed. Then I knew the issue must be inside the md_sync part but it’s still hundreds of the lines. The md_sync part can be divided into four parts. By commenting out each part and testing separately, I finally found which part and looked into that line by line.

Even though I was sure it’s the part p that results to the issue, it still looked far-fetched to drag connection from p to the issue. The back traces showed no information related to p and the logic of the p looked correct. Where could go wrong? After staring at the code for a long time, it occurred to me that the buffer pointer was changed and the wrong pointer was taken to free. Case closed.

https://lwn.net/Articles/557478/ ↩︎