Not Exatcly About NVMe ZNS Emulation

This blog is the second part of NVMe ZNS emulation. You can find the first one on this link: A Tour of NVMe.

First, the tour continues by discussing the hanging questions and introducing the latest solutions to metadata persistence. Second, we delve into address translation and elaborate on the details of ZNS metadata on NVMe device side. Afterward, we will shift our focus away from NVMe and briefly explore Linux cluster storage.

There were several approaches to persistence when Dimtry contributed ZNS emulation[1]. The one I apply is to add a separate blockdev which stores the zoned metadata in the Qcow2 image. Qcow2 is better than Raw format for its popularity in virtual image format.

Another way is using a separate hostmem/mmap’ed file. However, the QEMU block layer does not support mmap. A BlockDriverState might not have a file descriptor that can be mapped.

Qcow2 full emulation

Qcow2 is a storage format for virtual disks. Zoned storage emulation is added as a new format extension to the Qcow2 driver. Thereby users can attach such a Qcow2 image file to the guest os as a zoned device.

The zone state machine depicts state transitions and zone resources[2]. Read only and offline states are caused by device internal events, which is ignored in full emulation for simplicity. The other zone states contain empty, full, open and closed states. If the guest or QEMU crashes, the zone states are needed to recover. In real devices, open states (explicit open, implicit open) will lost and turn into closed state after a power cycle. Meanwhile, write pointers should be preserved to track usable blocks within the zone.

We need to think of a way to maintain zone states correctly and guarantee zoned metadata persistence. Write pointer is an array of unsigned 64-bit intergers. Even though it can store zone information that has few bits like zone state, in-memory zone state is good enough to serve its purpose without extra cost on state assignment to write pointers. Without loss of active zones, Qcow2 keeps track of active states in doubly linked lists. When reading zone states, it will first check if the zone is in each zone list and check wp when finding none.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
typedef struct Qcow2Wp {
uint64_t wp;
QLIST_ENTRY(Qcow2Wp) exp_open_zone_entry;
QLIST_ENTRY(Qcow2Wp) imp_open_zone_entry;
} Qcow2Wp;

typedef struct BDRVQcow2State {
...
/* States of zoned device */
Qcow2ZonedHeaderExtension zoned_header;
QLIST_HEAD(, Qcow2Wp) exp_open_zones;
QLIST_HEAD(, Qcow2Wp) imp_open_zones;
Qcow2Wp *wp;
...
} BDRVQcow2State;

Another kind of in-memory state is zone resources. It has open and active zone limits to affect zone operations. Write requests on zones no longer excute when any one of the zone limits is over. If there is room left for active zones, the device can implicitly close one zone (implicit open) to avoid exceeding open zone limit.

ZNS metadata persistence

ZNS emulation is fully compiant with the NVMe ZNS spec except persistent zone states. We can add persistence to the metadata of ZNS emulation by taking advantage of new block layer APIs and the Qcow2 image as backing file. It divides into two parts, zone states and ZDED (zone descriptor extension data) which is one of zone attribute fields.

address translation

NVMe device internal memory stores data directly from the guest depending on where the io_queues are stored, host or NVMe device. Normally, the guest address will be translated to host address. But if it’s DMA (direct memory access), the guest cannot map the address and will trigger a bouce buffer. The host will trap it asking if Nvme device knows about it. CMB (controller memory buffer) and PMR (persistent memory region) are designed for data mapping there. PMR always persistents even the crashes. It can resume the writes. Only one bounce buffer in flight is trapped.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static inline void nvme_blk_zone_append(BlockBackend *blk, int64_t *offset,
uint32_t align,
BlockCompletionFunc *cb,
NvmeZoneCmdAIOCB *aiocb)
{
NvmeRequest *req = aiocb->req;
assert(req->sg.flags & NVME_SG_ALLOC);

if (req->sg.flags & NVME_SG_DMA) {
req->aiocb = dma_blk_zone_append(blk, &req->sg.qsg, (int64_t)offset,
align, cb, aiocb);
} else {
req->aiocb = blk_aio_zone_append(blk, offset, &req->sg.iov, 0,
cb, aiocb);
}
}

zone attribute: ZRWA, ZDED

The ZRWA (zone random write area) is a legacy bit for adaption. The old drives do not support sequential constraints and set this bit to valid. ZRWAV is in runtime. All writes to ZRWA must be persistent while the write pointer is not advanced immediately[3].

The size of ZDED is relatively small comparing to the overall size of image file therefore we adopt the option that stores ZDED of all zones in an array regardless of the valid bit set.

In the end, I would like to express my gratitude to Stefan Hajnoczi, Damien Le Moal and Dmitry Fomichev for their guidance and support over the last year. And I have gained much more than the experience within the QEMU community.


  1. https://patchew.org/QEMU/20201208200410.27900-1-dmitry.fomichev@wdc.com/ ↩︎

  2. https://zonedstorage.io/docs/introduction/zoned-storage#zone-states-and-state-transitions ↩︎

  3. Comments by Klaus Jensen. ↩︎