Tracing down Bio in Block Subsystems

I wrote this blog to answer two questions:

  1. Layers involved of I/O requests like write sth to a file in a local computer?
  2. How does md device (or any other block device like null block driver) receive its data?

Bottom halves

  1. Bottom halves perform interrupt-related work that was not performed by the interrupt handler (top half)
    • Run with all interrupts enabled
    • Deferring work means not now
  2. Work queue is a simple interface for deferring work to a generic kernel thread

runqueue & waitqueue

Interface

  1. queuing work to workqueue[1]

    1
    2
    3
    4
    queue_work
    queue_work_on
    queue_delayed_work
    queue_delayed_work_on
  2. schedule work to workqueue

Block drivers

  1. No need to open another kernel thread when using workqueues
  2. Waitqueue waits on the loop until the condition is met: https://stackoverflow.com/questions/11184581/why-does-wait-queue-implementation-wait-on-a-loop-until-condition-is-met
  3. wakeup will trigger an interrupt
    • wake_up_interruptible wakes up only the processes that are in interruptible sleeps
  4. BIOs can be split, merged (chain). It’s in the scheduling layer.
  5. The null_blk driver is a bit different than others. It has two ways of receiving commands: bio based, req based.
  6. Device drivers are normally request based. BIOs are already split/merged in the block layer (scheduling) and grouped to a req which is sent to the device drivers. It should not touch BIOs inside a req/command in the device driver. The job of device driver is to translating a req to corresponding command.
  7. In-flight BIOs in the device driver don’t conclude the BIOs in the requests.
  8. Linux is running on the async context.
  9. flow control on device drivers may not be a good idea. A lot of places in the block layer have already done/could do that, like scheduling layer where requests are regulated.

Block io

v6.3-rc2

high level: app -> fs -> block level

1
2
3
4
5
6
7
8
9
application
VFS
File system (XFS, btrfs, etc)
Page cache
Block layer
- Device mapper
Driver Level
- I/O scheduler
- Physical device driver

bio -> bio_vec/bi_sector -> memory page

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
struct bio {
struct bio *bi_next; /* request queue link */
struct block_device *bi_bdev;
blk_opf_t bi_opf;
unsigned short bi_flags; /* BIO_* below */
unsigned short bi_ioprio;
blk_status_t bi_status;
atomic_t __bi_remaining; /* usage counter */

struct bvec_iter bi_iter;

blk_qc_t bi_cookie;
bio_end_io_t *bi_end_io;
void *bi_private;
...

atomic_t __bi_cnt; /* pin count */
struct bio_vec *bi_io_vec; /* the actual vec list */
struct bio_set *bi_pool;
struct bio_vec bi_inline_vecs[];
}

gendisk -> request queue/block device -> request

submit_bio() -> submit_bio_noaccout -> submit_bio_noacct_nocheck -> _submit_bio/_submit_bio_noacct

(generic_make_request[2], v<=5.8)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
struct request_queue {
struct request *last_merge;
struct elevator_queue *elevator;

struct percpu_ref q_usage_counter;
struct blk_queue_stats *stats;
struct rq_qos *rq_qos;
const struct blk_mq_ops *mq_ops;
struct blk_mq_ctx __percpu *queue_ctx;
unsigned int queue_depth;
void *queuedata;
unsigned long queue_flags;
...
spinlock_t queue_lock;
struct gendisk *disk;
unsigned long nr_requests; /* Max # of requests */
...
};

struct request {
struct request_queue *q;
blk_opf_t cmd_flags; /* op and common flags */
req_flags_t rq_flags;
...

/* the following two fields are internal, NEVER access directly */
unsigned int __data_len; /* total data len */
sector_t __sector; /* sector cursor */

struct bio *bio;
struct bio *biotail;

union {
struct list_head queuelist;
struct request *rq_next;
};

struct block_device *part;
...
}

bio layer[3] -> request layer -> device driver[4]

request queue[5]

create/delete a rq: blk_mq_init_queue[6]

process a request: blk_mq_start_request

device mapper[7]

Flow control

Problem: Implement flow control for the bios in the md block device (or any other block device like null block driver) in Linux kernel

Bio is the unit to map data in the memory to generic block offset. Block device drivers receive requests or I/Os from I/O scheduler. The scheduler groups bios to requests and use a multi-queue machanism to dispatch events. Device mapper is on top of I/O scheduler. DM receives BIOs from the user/fs and remap them first according to the DM properties.

First, look into the null_blk driver. In flight BIO is an operation that has been requested, but hasn’t been initiated yet. The shared resources in flow control are in-flight BIOs. The invariants of flow control for the bios in the null_blk driver are:

  1. Increasing: If the # of in-flight bios >= high, the thread that is processing bio will be blocked
  2. Decreasing: If the # of in-flight bios < low, the blocked threads will continue executing.

Secendly, the null_blk driver receives requests from mq block layer or IOs directly. The request queue model and IO queue model will both handle the BIOs through *_handle_cmd and finish BIOs through *_complete_cmd. Zoned commands are part of processing IO and can be ignored as the other processing part.

Intuitively, there are two choices to add locking for null_blk driver:

  1. When the amount of blocking time is comparably smaller than the thread switching time, use spinlock
  2. spinlock, semaphore + sleep/wakeup
  3. sleeplock, atomics

This patch applies method 2. It can be implemented by atomic ops, waitqueue and spinlock.

Waitqueue has its disadavantages on the interrupted context. Workqueue is better for handling tasks without openning another thread.

Finally, cases that increase BIOs are handled in two parts:

Let n be # of BIO of a req,

  1. n = 1, bio_in_flight > high
  2. n + bio_in_flight > high

  1. https://embetronicx.com/tutorials/linux/device-drivers/work-queue-in-linux-own-workqueue/ ↩︎

  2. v4.5 block layer ↩︎

  3. http://books.gigatux.nl/mirror/kerneldevelopment/0672327201/ch13lev1sec3.html ↩︎

  4. http://blog.vmsplice.net/2020/04/how-linux-vfs-block-layer-and-device.html ↩︎

  5. https://linux-kernel-labs.github.io/refs/heads/master/labs/block_device_drivers.html#request-queues-multi-queue-block-layer ↩︎

  6. https://linux-kernel-labs.github.io/refs/heads/master/labs/block_device_drivers.html#create-and-delete-a-request-queue ↩︎

  7. https://xuechendi.github.io/2013/11/14/device-mapper-deep-dive ↩︎