lxd/storage/drivers/zfs_volumes: Rework zvol resolution logic (from Incus)#17095
lxd/storage/drivers/zfs_volumes: Rework zvol resolution logic (from Incus)#17095tomponline merged 10 commits intocanonical:mainfrom
Conversation
c2b09ae to
59b22a6
Compare
There was a problem hiding this comment.
Pull request overview
This PR reworks the ZFS zvol device resolution logic by replacing the previous approach that relied on udev symlinks and the zvol_id utility with a more direct method using the BLKZNAME ioctl. The changes were cherry-picked from Incus PR #2673.
Key Changes
- Replaced udev symlink traversal with direct BLKZNAME ioctl queries to identify ZFS block devices
- Removed dependency on the
zvol_idutility binary - Reduced udev-related sleep delays from 1 second to 500ms in snapshot mounting path
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| lxd/storage/drivers/driver_zfs_volumes.go | Core changes: new device resolution logic using BLKZNAME ioctl, sorting devices by creation time, and removed 1-second udev wait delay in volume activation |
| lxd/linux/ioctls.go | Added BLKZNAME ioctl constant definition for querying ZFS dataset names from block devices |
| lxd/linux/cgo.go | New file containing CGO build configuration consistent with other linux package files |
60418ce to
cb2d194
Compare
c445d10 to
411de36
Compare
411de36 to
6c10172
Compare
6e2b8a1 to
2806f3f
Compare
| continue | ||
| aInfo, _ := a.Info() | ||
| if aInfo != nil { | ||
| stat, ok := aInfo.Sys().(*syscall.Stat_t) |
There was a problem hiding this comment.
this could be de-duplicated into a function for extracting creation time
There was a problem hiding this comment.
LGTM, but please can we stick with using the more readable time.Time data type.
There was a problem hiding this comment.
LGTM.
Just for the history. We discussed a race-condition that still exists here, because in some scenarios, during ZFS volume activation we might have 2 iterations of device create/device remove in the kernel, we theoretically might end up having a wrong device path as a result of getVolumeDiskPathFromDataset().
Imagine the following picture (sorry, I used AI to draw it but scheme itself is mine):
┌──────────────────────────────┬───────────────────────────────┬────────────────────────────────────────┐
│ lxc launch ubuntu:j │ │ lxc launch ubuntu:n │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ start clone │ │ start clone │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ zfs minor X created │ │ │
| (from clone) │ │ │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ minor X found by │ │ │
│ getVolumeDiskPathFromDataset │ │ │
│ and returned │ │ │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ zfs minor X removed │ RACE! cloning finishes → │ zfs minor X сreated (from clone) │
│ │ │ │
│ zfs minor Y created │ │ │
│ │ │ │
│ │ │ │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ │ │ │
│ │ │ zfs minor X removed │
│ │ │ zfs minor X created │
│ │ │ minor X found by │
│ │ │ getVolumeDiskPathFromDataset │
│ │ │ and returned │
└──────────────────────────────┴───────────────────────────────┴────────────────────────────────────────┘
We end up with both containers using the same ZFS block device (with minor X).
Indeed, I think we should add a zfs activation lock similar to the |
Signed-off-by: Stéphane Graber <stgraber@stgraber.org> (cherry picked from commit 69387b236ef768e22fecdfe814e372d7f4fd5f38) Signed-off-by: Simon Deziel <simon.deziel@canonical.com> License: Apache-2.0
The previous logic was relying either on udev or on a slow crawl through /dev using the zvol_id utility. This was problematic not just because we could get into timeouts on systems with a LOT of volumes, but also because it was inherently racy. Udev works in the background, so entries may not show up immediately and if something goes wrong during cleanup, we can be left with a symlink pointing to the wrong /dev/zdXYZ device. This is very similar to the LVM issue we fixed a little while back where in the worst case scenario, it can lead to the wrong block device being accessed, causing corruption or potential data leakage. The new logic relies entirely on devtmpfs (not udev) and is optimized to try and hit the correct /dev/zdXYZ device almost immediately, limiting the performance impact on busy systems. Closes lxc/incus#1505 Signed-off-by: Stéphane Graber <stgraber@stgraber.org> (cherry picked from commit 3a26837acde2f2b8d0612b55780feab5a28ee098) Signed-off-by: Simon Deziel <simon.deziel@canonical.com> License: Apache-2.0
Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
This reverts commit 1b3ba2d. Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
This reverts commit 7be37a1. Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
…t()` Looking up the creation time of `/dev/zd*` devices as we walk through `/dev/zd*` entries allows for a more efficient sorting by time and reduces the TOCTOU race window. Also save the change time to avoid the overhead of saving the full `os.DirEntry`. Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
…s are at the end Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
…y `/dev/zd*` files are used Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
2806f3f to
bcffb91
Compare
Cherry-picks from lxc/incus#2673