8000
Skip to content

lxd/storage/drivers/zfs_volumes: Rework zvol resolution logic (from Incus)#17095

Merged
tomponline merged 10 commits intocanonical:mainfrom
simondeziel:zfs-cherry-picks
Dec 3, 2025
Merged

lxd/storage/drivers/zfs_volumes: Rework zvol resolution logic (from Incus)#17095
tomponline merged 10 commits intocanonical:mainfrom
simondeziel:zfs-cherry-picks

Conversation

@simondeziel
Copy link
Copy Markdown
Member

Cherry-picks from lxc/incus#2673

Copy link
Copy Markdown
Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reworks the ZFS zvol device resolution logic by replacing the previous approach that relied on udev symlinks and the zvol_id utility with a more direct method using the BLKZNAME ioctl. The changes were cherry-picked from Incus PR #2673.

Key Changes

  • Replaced udev symlink traversal with direct BLKZNAME ioctl queries to identify ZFS block devices
  • Removed dependency on the zvol_id utility binary
  • Reduced udev-related sleep delays from 1 second to 500ms in snapshot mounting path

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
lxd/storage/drivers/driver_zfs_volumes.go Core changes: new device resolution logic using BLKZNAME ioctl, sorting devices by creation time, and removed 1-second udev wait delay in volume activation
lxd/linux/ioctls.go Added BLKZNAME ioctl constant definition for querying ZFS dataset names from block devices
lxd/linux/cgo.go New file containing CGO build configuration consistent with other linux package files

Comment thread lxd/storage/drivers/driver_zfs_volumes.go Outdated
Comment thread lxd/storage/drivers/driver_zfs_volumes.go Outdated
Copy link
Copy Markdown
Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comment thread lxd/storage/drivers/driver_zfs_volumes.go
Comment thread lxd/storage/drivers/driver_zfs_volumes.go Outdated
Comment thread lxd/storage/drivers/driver_zfs_volumes.go Outdated
Comment thread lxd/storage/drivers/driver_zfs_volumes.go Outdated
Copy link
Copy Markdown
Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread lxd/storage/drivers/driver_zfs_volumes.go
@simondeziel simondeziel requested a review from mihalicyn December 1, 2025 15:38
@simondeziel simondeziel marked this pull request as ready for review December 1, 2025 15:38
@simondeziel simondeziel force-pushed the zfs-cherry-picks branch 2 times, most recently from 6e2b8a1 to 2806f3f Compare December 3, 2025 03:18
continue
aInfo, _ := a.Info()
if aInfo != nil {
stat, ok := aInfo.Sys().(*syscall.Stat_t)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be de-duplicated into a function for extracting creation time

Comment thread lxd/storage/drivers/driver_zfs_volumes.go Outdated
Copy link
Copy Markdown
Member
@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please can we stick with using the more readable time.Time data type.

mihalicyn
mihalicyn previously approved these changes Dec 3, 2025
Copy link
Copy Markdown
Contributor
@mihalicyn mihalicyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Just for the history. We discussed a race-condition that still exists here, because in some scenarios, during ZFS volume activation we might have 2 iterations of device create/device remove in the kernel, we theoretically might end up having a wrong device path as a result of getVolumeDiskPathFromDataset().

Imagine the following picture (sorry, I used AI to draw it but scheme itself is mine):

┌──────────────────────────────┬───────────────────────────────┬────────────────────────────────────────┐
│ lxc launch ubuntu:j          │                               │ lxc launch ubuntu:n                    │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ start clone                  │                               │ start clone                            │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ zfs minor X created          │                               │                                        │
|         (from clone)         │                               │                                        │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ minor X found by             │                               │                                        │
│ getVolumeDiskPathFromDataset │                               │                                        │
│ and returned                 │                               │                                        │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ zfs minor X removed          │   RACE!   cloning finishes →  │ zfs minor X сreated (from clone)       │
│                              │                               │                                        │
│ zfs minor Y created          │                               │                                        │
│                              │                               │                                        │
│                              │                               │                                        │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│                              │                               │                                        │
│                              │                               │ zfs minor X removed                    │
│                              │                               │ zfs minor X created                    │
│                              │                               │ minor X found by                       │
│                              │                               │ getVolumeDiskPathFromDataset           │
│                              │                               │ and returned                           │
└──────────────────────────────┴───────────────────────────────┴────────────────────────────────────────┘

We end up with both containers using the same ZFS block device (with minor X).

@tomponline
Copy link
Copy Markdown
Member

LGTM.

Just for the history. We discussed a race-condition that still exists here, because in some scenarios, during ZFS volume activation we might have 2 iterations of device create/device remove in the kernel, we theoretically might end up having a wrong device path as a result of getVolumeDiskPathFromDataset().

Imagine the following picture (sorry, I used AI to draw it by scheme itself is mine):

┌──────────────────────────────┬───────────────────────────────┬────────────────────────────────────────┐
│ lxc launch ubuntu:j          │                               │ lxc launch ubuntu:n                    │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ start clone                  │                               │ start clone                            │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ zfs minor X created          │                               │                                        │
|         (from clone)         │                               │                                        │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ minor X found by             │                               │                                        │
│ getVolumeDiskPathFromDataset │                               │                                        │
│ and returned                 │                               │                                        │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│ zfs minor X removed          │   RACE!   cloning finishes →  │ zfs minor X сreated (from clone)       │
│                              │                               │                                        │
│ zfs minor Y created          │                               │                                        │
│                              │                               │                                        │
│                              │                               │                                        │
├──────────────────────────────┼───────────────────────────────┼────────────────────────────────────────┤
│                              │                               │                                        │
│                              │                               │ zfs minor X removed                    │
│                              │                               │ zfs minor X created                    │
│                              │                               │ minor X found by                       │
│                              │                               │ getVolumeDiskPathFromDataset           │
│                              │                               │ and returned                           │
└──────────────────────────────┴───────────────────────────────┴────────────────────────────────────────┘

We end up with both containers using the same ZFS block device (with minor X).

Indeed, I think we should add a zfs activation lock similar to the lvmActivation in use here lxc/incus@77a5673

Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
(cherry picked from commit 69387b236ef768e22fecdfe814e372d7f4fd5f38)
Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
License: Apache-2.0
The previous logic was relying either on udev or on a slow crawl through
/dev using the zvol_id utility.

This was problematic not just because we could get into timeouts on
systems with a LOT of volumes, but also because it was inherently racy.
Udev works in the background, so entries may not show up immediately and
if something goes wrong during cleanup, we can be left with a symlink
pointing to the wrong /dev/zdXYZ device.

This is very similar to the LVM issue we fixed a little while back where
in the worst case scenario, it can lead to the wrong block device being
accessed, causing corruption or potential data leakage.

The new logic relies entirely on devtmpfs (not udev) and is optimized to
try and hit the correct /dev/zdXYZ device almost immediately, limiting
the performance impact on busy systems.

Closes lxc/incus#1505

Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
(cherry picked from commit 3a26837acde2f2b8d0612b55780feab5a28ee098)
Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
License: Apache-2.0
Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
This reverts commit 1b3ba2d.

Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
This reverts commit 7be37a1.

Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
…t()`

Looking up the creation time of `/dev/zd*` devices as we walk through `/dev/zd*`
entries allows for a more efficient sorting by time and reduces the TOCTOU race
window.

Also save the change time to avoid the overhead of saving the full `os.DirEntry`.

Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
…s are at the end

Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
…y `/dev/zd*` files are used

Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
Signed-off-by: Simon Deziel <simon.deziel@canonical.com>
Copy link
Copy Markdown
Member
@tomponline tomponline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@tomponline tomponline merged commit 0c989dd into canonical:main Dec 3, 2025
258 checks passed
@simondeziel simondeziel deleted the zfs-cherry-picks branch December 3, 2025 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

0