olicesx · 2026-02-15T07:30:51Z

Background

This PR improves control-plane robustness with a focus on DNS forwarding reliability and concurrency safety.
Main updates include:

Improve tcp+udp DNS upstream behavior with robust UDP-first and TCP fallback handling.
Feed DNS forward failures into dialer health feedback so failover decisions can react faster.
Harden DNS/UDP connection lifecycle under high concurrency.
Add regression tests for fallback, timeout cleanup, and pool safety.

Checklist

The Pull Request has been fully tested
There's an entry in the CHANGELOGS
There is a user-facing docs PR against https://github.com/daeuniverse/dae

Full Changelogs

feat(dns): add robust DNS forward fallback path for tcp+udp upstream (UDP-first with TCP fallback on request failure).
fix(dns): report DNS forward failures to dialer health feedback path to improve failover quality.
fix(control): harden DNS/UDP connection lifecycle handling in high-concurrency paths.
test(control): add regression tests for DNS fallback, timeout cleanup, and pool concurrency safety.

Issue Reference

Closes #[issue number]

Test Result

Environment: Linux

Passed:

go test ./common/consts ./component/sniffing ./control
go test -race ./control
go build .

Copilot

Pull request overview

This PR hardens the control-plane DNS forwarding and connection lifecycle under concurrency, adding a UDP-first/TCP-fallback path for tcp+udp DNS upstreams, health feedback on forward failures, and multiple regression tests to validate fallback, cleanup, and pool safety.

Changes:

Add robust DNS forwarding behavior: UDP-first with TCP fallback for tcp+udp upstreams, plus dialer health feedback on forward failures and concurrency limiting/singleflight coalescing.
Improve UDP/TCP connection lifecycle and concurrency safety (UDP task pool GC, DNS connection pooling/pipelining, routing-result caching for UDP endpoints).
Add/expand regression tests and update changelog entries.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
control/utils.go	Adjust routing matcher call signature and MAC formatting to fixed-size 16-byte inputs.
control/udp_task_pool_test.go	Replace prior timing-based test with deterministic concurrency/order regression tests.
control/udp_task_pool.go	Rework UDP task queue lifecycle/GC for concurrency safety and predictable idle cleanup.
control/udp_routing_cache_test.go	Add tests for UDP endpoint routing-result cache hit/expire behavior.
control/udp_endpoint_pool.go	Add per-endpoint routing-result cache with TTL and concurrency protection.
control/udp.go	Add fast QUIC prefilter to avoid expensive sniffing; handle DNS concurrency-limit refusal explicitly.
control/tcp_test.go	Add regression test ensuring RelayTCP cancellation unblocks the opposite direction.
control/tcp.go	Use context-driven cancellation to interrupt the other copy direction promptly; switch to consts IPPROTO.
control/routing_matcher_userspace.go	Change matcher inputs to fixed-size `[16]uint8` and reduce per-call allocations.
control/packet_sniffer_pool_test.go	Make packet sniffer tests isolate global pool state and remove sessions explicitly.
control/dns_pipelining_bench_test.go	Add benchmarks around pipelined DNS conn, ID allocation, and contention patterns.
control/dns_pipelined_conn_test.go	Add tests ensuring pipelined conn cleanup on success/timeout and input ID preservation.
control/dns_listener.go	Fix TCP listener start logic and ignore concurrency-limit errors (REFUSED already written).
control/dns_id_bitmap_test.go	Add tests for concurrent ID allocation uniqueness and reuse behavior.
control/dns_fallback_test.go	Add tests for UDP→TCP fallback and for avoiding dialer poisoning on canceled contexts.
control/dns_control.go	Major DNS controller hardening: concurrency limiter, singleflight coalescing, sync.Map caches, forwarder caching, fallback routing, and close lifecycle.
control/dns_conn_pool_test.go	Add tests for UDP conn pool close/put races, conn pool dial contention, and responseSlot reuse.
control/dns_concurrency_test.go	Add regression test validating concurrency limiter rejection behavior.
control/dns_cache.go	Replace reflection-based deep copy with explicit RR copying and add DnsCache.Clone.
control/dns.go	Implement pooled/pipelined DNS transports, TCP/TLS connection pooling, UDP conn pool, and improved cleanup paths.
control/control_plane_core_test.go	Add race regression test validating atomic core flip behavior.
control/control_plane_core.go	Make core flip atomic/CAS-based and apply best-effort qdisc/filter cleanup comments.
control/control_plane.go	Wire DNS controller Close into lifecycle, switch to sync.Map cloning, and add UDP routing-result caching in receive path.
control/connectivity.go	Use local consts IP protocol numbers instead of unix constants.
control/bpf_utils.go	Add documentation notes for auto-generated BPF types; improve batch map error wrapping.
control/anyfrom_pool.go	Gate GSO detection behind env var to keep GSO disabled by default but optionally testable.
component/sniffing/sniffer.go	Add cheap QUIC header precheck to avoid costly sniffing when not applicable.
component/sniffing/quic_test.go	Add unit tests for the QUIC initial-packet precheck.
component/sniffing/quic.go	Add `IsLikelyQuicInitialPacket` helper for fast-path filtering.
common/consts/dialer_test.go	Add tests for protocol conversion helpers.
common/consts/dialer.go	Replace unix protocol constants with local IANA protocol numbers; add doc comments.
CHANGELOGS.md	Add Unreleased section documenting DNS/control robustness improvements and tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-15T07:39:25Z

+			if slot == nil {
+				continue
+			}
+			slot.set(&msg)


In pipelinedConn.readLoop, slot.set(&msg) passes a pointer to the loop-local msg variable. Because multiple in-flight requests can be pending concurrently, this can lead to responses sharing/overwriting the same underlying dnsmessage.Msg storage (and potential data races / wrong answers delivered). Allocate a distinct message per response before sending it to the slot (e.g., create a new dnsmessage.Msg and copy msg into it, or copy into a new variable whose address won’t be reused across iterations).

Suggested change

slot.set(&msg)

// Allocate a distinct message instance per response to avoid

// sharing the loop-local msg between concurrent requests.

resp := new(dnsmessage.Msg)

*resp = msg

slot.set(resp)

Copilot · 2026-02-15T07:39:25Z

 func (c *DnsCache) FillInto(req *dnsmessage.Msg) {
-	req.Answer = deepcopy.Copy(c.Answer).([]dnsmessage.RR)
+	if c.Answer != nil {
+		req.Answer = make([]dnsmessage.RR, len(c.Answer))
+		for i, rr := range c.Answer {
+			req.Answer[i] = dnsmessage.Copy(rr)
+		}
+	}


DnsCache.FillInto only assigns req.Answer when c.Answer != nil. If the cache entry has a nil/empty Answer, this leaves any existing req.Answer intact, which can accidentally leak stale answers if the dnsmessage.Msg is reused. Always set req.Answer explicitly (set to nil/empty when c.Answer is nil) before setting the other response fields.

MaurUppi · 2026-02-22T04:07:47Z

用 dnsperf 测试了 5000QPS， Client 10， Worker threads 2， Stats output interval 1

dnsperf -s 192.168.1.15 \
        -d targets/queries_cache_heavy.txt \
        -c 10 \
        -T 2 \
        -l 60 \
        -S 1 \
        -Q 5000

另外，也查询了自启动以来的broken pipe 记录数，相比 v1.0.0从几千大幅减少到个位数
journalctl -u dae --since "2026-02-21 19:19:47" --no-pager | grep -c "broken pipe"

使用感觉一切正常。

MaurUppi · 2026-02-23T05:45:59Z

如果下述不适合发这里，请告诉我删除。

feat(dns): optimize DNS caching with async write and lock-free upstream resolver 594f449

这个改进挺厉害的～～

维度	Benchmark	指标	`7c2f407` -> `594f449`	origin/main(`5268be5`) -> `594f449`	结论
Upstream 热路径	`BenchmarkUpstreamResolver_GetUpstream_Serial`	`sec/op`	`5.400ns -> 2.218ns`（-58.9%）	`5.444ns -> 2.219ns`（-59.2%）	明显更快
Upstream 热路径	`BenchmarkUpstreamResolver_GetUpstream_Parallel`	`sec/op`	`43.590ns -> 3.355ns`（-92.3%）	`48.750ns -> 3.878ns`（-92.0%）	并发路径提升很大
Upstream 汇总	上述两项 geomean	`sec/op`	`-82.22%`	`-81.99%`	趋势一致，收益稳定
DNS 控制面常规路径	`PipelinedConn_Sequential`	`sec/op`	`2.964us -> 2.975us`（+0.4%）	`3.056us -> 3.008us`（-1.6%）	基本持平
DNS 控制面常规路径	`PipelinedConn_Concurrent`	`sec/op`	`3.326us -> 3.254us`（-2.2%）	`3.450us -> 3.446us`（-0.1%）	基本持平
DNS 控制面常规路径	`DnsController_Singleflight`	`sec/op`	`0.04573n -> 0.04183n`	`0.03788n -> 0.04818n`	微基准噪声大，参考价值低
控制面汇总	上述 3 项 geomean	`sec/op`	`-3.51%`	`+7.74%`	受噪声/上下文影响，不建议单独作为结论

维度	Benchmark	结果（head-only）	说明
Async 缓存收益	`BenchmarkAsyncCacheVsSyncCache`	Async `~723-923ns/op` vs Sync `~1.53-1.64ms/op`	量级约 `1700x~2200x`（同机同次）
防击穿	`BenchmarkAsyncCacheWithSingleflight`	`upstream_calls/op ≈ 0.9~1.0`（1~1000 并发）	singleflight 有效
高并发去重	`BenchmarkHighQpsScenario`	`deduplication_rate_% = 100%`	去重行为符合预期

MaurUppi · 2026-02-26T12:32:12Z

@olicesx
在我的环境终端设备并不多，但是 short-live （QUIC）类的在 13 分钟就有 300 条，debug level 显示太多。所以，我在你的代码基础上 patch 了可观测性（除了第一条，然后每 300 条显示一次），你的原版不显示short-lived UDP fast path fallback 以及 dst/err/src，<->对日志过滤不够友好。

话说，你这个 PR 的 commit （100+）太庞大了，啥时候能完工呢。

02-26 20:19:55 level=debug msg="UDP routing tuple missing; short-lived UDP fast path fallback (Total=7200)" dst="192.168.1.15:53"
02-26 20:19:15 level=debug msg="UDP routing tuple missing; short-lived UDP fast path fallback" dst="198.18.0.2:53" error="reading map: key [192.168.1.171:57322, 17, 198.18.0.2:53]: lookup: key does not exist" src="192.168.1.171:57322"
02-26 20:06:49 level=debug msg="UDP routing tuple missing; short-lived UDP fast path fallback (Total=6900)" dst="192.168.1.15:53"

olicesx · 2026-02-26T12:38:46Z

@MaurUppi debug 日志就是用来 debug 的，以及现在并没有改动太多日志相关内容后续再说吧

MaurUppi · 2026-02-26T13:33:39Z

@MaurUppi debug 日志就是用来 debug 的，以及现在并没有改动太多日志相关内容后续再说吧

debug level 显示太多只是我个人偏好而已，但后续若是改造日志的话，还是希望考虑结构性/方便过滤查询的方式。
期待 PR936 早点合并

MaurUppi · 2026-02-27T02:10:16Z

@olicesx
供参考

udp-taskpool-cpu-regression summary

已确认

705/1077 goroutine 是 convoy，CPU 占比 34.4%（cumulative），与文档一致
tryDeleteQueue 的 LoadAndDelete + 指针比较确实不是原子 CAS — 会删掉新队列并导致 goroutine 永久泄漏
commit 归因正确：a795323 引入缺陷，41d20f2 触发放大
selectgo 80ms flat 是最大的单项 flat 开销 — 数百个 convoy 每 100ms 被调度一次导致的调度器抖动

新发现（原文未覆盖）

acquireQueue 第 246 行的 p.queues.Delete(key) 存在同类竞态 — 两个 goroutine 同时 acquireQueue 且都看到 draining=true 时，后者的 Delete 可能误删新队列。此处也需要改为
CompareAndDelete
泄漏的 convoy goroutine 永远无法退出（key 不在 map 中 → tryDeleteQueue 始终返回 false → 无限循环），705 个是重启以来 ~17 小时的累计泄漏量

修复建议

必做：tryDeleteQueue → sync.Map.CompareAndDelete（Go 1.20+，项目用 1.26）
必做：acquireQueue draining 路径的 Delete 也改为 CompareAndDelete
建议：收紧 QUIC 检测（增加 version 字段校验），降低 6.25% 的误命中率
建议：评估 100ms agingTime 是否过于激进

2026-02-27-udp-taskpool-cpu-regression-analysis

# dae CPU 回归排查记录（2026-02-27）

1. 背景与现象

观察到在 2026-02-26 16:52:43 部署 unstable-20260226.pr-26.r44.0da556 后，CPU usage 从常态 1%~3% 上升到约 6%~10%。
业务负载体感无明显增加，属于版本后回归现象。
当前 pprof 访问端口为 :5556（endpoint_listen_address），并非 pprof_port 直绑 :6060。

2. 关键证据（本次排查）

2.1 在线指标

go_goroutines 约 1035~1084。
process_cpu_seconds_total 30 秒窗口估算 CPU 约 9.47%。
15 分钟日志统计（示例窗口）：
- dns_udp4_info=936
- dns_tcp4_info=97
- udp_tuple_miss=5
- 其他高噪 debug/warn（如 rewrite/conn_check/udp_readloop）在当前窗口接近 0。

2.2 CPU pprof（2026-02-27 09:42:34）

github.com/daeuniverse/dae/control.(*UdpTaskQueue).convoy 累计约 31.71%。
runtime 调度热点显著：
- runtime.schedule/findRunnable/selectgo 占比较高。
convoy 行级热点集中在：
- control/udp_task_pool.go:151（select 循环）
- control/udp_task_pool.go:167（time.Sleep(10ms)）
- control/udp_task_pool.go:177（tryDeleteQueue）

2.3 goroutine pprof

(*UdpTaskQueue).convoy goroutine 约 670，为总 goroutine 主体。
goroutine 快照中大量栈停在 control/udp_task_pool.go:151，与 CPU 调度热点一致。

3. 根因分析

3.1 直接缺陷（根因）

control/udp_task_pool.go 的 tryDeleteQueue 当前实现：
- 先 LoadAndDelete(key)，后比较 expected 指针。
- 代码由 a795323 引入（refactor(control): optimize memory alignment and improve task queue management，2026-02-17）。
问题点：
- 该逻辑不是原子“按期望值删除”。
- 并发下可能删掉“新队列映射”，随后比较失败返回 false，但删除已发生。
- 结果是队列映射状态异常，旧/新 convoy 都可能进入长期空转循环，导致 goroutine 与调度开销持续偏高。

3.2 放大因素（触发本次回归）

41d20f2（refactor(control): optimize UDP handling for QUIC packets and enhance task pool management，2026-02-25）引入两项变化，显著放大上述缺陷影响：
- 将 UdpTaskPool 触发条件改为 IsLikelyQuicInitialPacket 命中时使用；
- 将 agingTime 从 DefaultNatTimeout（历史为 30s）改为 100ms。
IsLikelyQuicInitialPacket 仅基于首字节位判断（component/sniffing/quic.go），理论命中空间约 16/256 = 6.25%，存在较高“伪命中”概率，导致更多队列创建/回收尝试。
100ms 老化周期使 convoy 更频繁进入清理分支，进一步放大删除竞态造成的调度成本。

4. commit 溯源结论

引入缺陷的 commit：a795323（2026-02-17）
- 证据：git blame control/udp_task_pool.go:263-270 指向 a795323，首次引入 LoadAndDelete + pointer compare 方案。
触发/放大回归的 commit：41d20f2（2026-02-25）
- 证据：git blame control/udp_task_pool.go:20-25 指向 41d20f2，引入 UdpTaskPoolAgingTime = 100ms；
- 同提交将 control_plane 改为部分 UDP 路径重新走 UdpTaskPool。

5. 建议修复方案

5.1 必选修复（先做）

将 tryDeleteQueue 改为真正原子语义：
- 使用 sync.Map.CompareAndDelete(key, expected)（Go 版本满足）；
- 禁止“先删后比对”。
目标：消除错误删除新映射导致的队列异常与 convoy 常驻空转。

5.2 建议修复（次步）

收紧 IsLikelyQuicInitialPacket 快速判定，降低伪命中：
- 增加更强头部约束（不仅首字节位）。
评估 agingTime=100ms 是否过激进：
- 可回调到更保守值（如秒级）或采用自适应策略。

5.3 可观测性补充

增加 udp_task_pool 相关 metrics：
- 当前队列数、创建数、删除成功数、删除失败数、convoy 活跃数。
便于回归版本对比与告警。

6. 修复后验收标准

go_goroutines 中 (*UdpTaskQueue).convoy 数量明显下降，不再长期维持数百级空转。
CPU 回落接近历史基线（业务负载相近时接近 1%~3% 区间）。
pprof 中 UdpTaskQueue.convoy 与调度热点占比显著下降。

Opus-reviewed

# Opus Review: UDP TaskPool CPU Regression Analysis

Reviewed: 2026-02-27
Original doc: .plan/2026-02-27-udp-taskpool-cpu-regression-analysis.md
Status: Analysis largely correct; additional race conditions and nuances identified below.

1. Verification of Live Evidence

All key claims independently confirmed via live pprof (captured 2026-02-27 ~10:00 CST):

Metric	Documented	Live Measurement	Match
Total goroutines	~1035-1084	1077	✅
convoy goroutines	~670	705 (607 in `select` state)	✅
convoy CPU (cum)	~31.71%	34.41% (320ms/930ms)	✅
Line 151 `select` hotspot	dominant	160ms (17.2% of total)	✅
Line 167 `time.Sleep(10ms)`	visible	40ms (4.3%)	✅
Line 182 `safeTimerReset`	visible	70ms (7.5%)	✅
`selectgo` runtime overhead	high	80ms flat, 180ms cum (19.4%)	✅

Key observations:

705 convoy goroutines are in select state without explicit wait duration — meaning they are being scheduled frequently (every ~100ms timer tick), not sleeping.
runtime.selectgo at 80ms flat is the single largest flat-time consumer — direct evidence of scheduler thrashing from hundreds of convoy goroutines waking on timer.
runtime.schedule/findRunnable at 370-380ms cumulative confirms scheduler overhead from excessive runnable goroutines.

2. Root Cause Validation

2.1 `tryDeleteQueue` Race — CONFIRMED, with additional detail

The documented race is correct. Here is the precise sequence:

Timeline for key K:

1. Old convoy (Q1): timer fires → refs==0, ch empty
   → sets draining=true → sleeps 10ms → final check passes
   → calls tryDeleteQueue(K, Q1)

2. Meanwhile, acquireQueue(K) runs:
   → Load(K) returns Q1 → sees draining==true → goto createNew
   → LoadOrStore(K, Q2) → succeeds (stores Q2, returns loaded=false)
   → starts new convoy goroutine for Q2

3. tryDeleteQueue(K, Q1) executes:
   → LoadAndDelete(K) → removes Q2 from map, returns Q2
   → Q2 != Q1 → returns false

Result:
- Q2 is removed from the map, but Q2's convoy goroutine is running
- Q2's convoy can NEVER exit: tryDeleteQueue always returns false
  (key no longer in map → loaded=false → returns false)
- Q2 loops forever: timer fires (100ms) → draining → sleep(10ms) →
  tryDeleteQueue fails → reset timer → repeat

This is a goroutine leak with O(n) accumulation — each false-positive QUIC match for a unique AddrPort can produce an orphaned convoy.

2.2 Additional Race in `acquireQueue` Draining Path — NOT documented

The original analysis missed a second race at udp_task_pool.go:246:

if q.draining.Load() {
    p.queues.Delete(key)    // ← non-atomic delete of whatever is currently at key
    goto createNew
}

Scenario: Two goroutines (A, B) both call acquireQueue(K) and both see draining=true:

1. A: LoadOrStore(K) → loaded=true, returns Q1 (draining)
2. B: LoadOrStore(K) → loaded=true, returns Q1 (draining)
3. A: Delete(K)     — removes Q1
4. C: LoadOrStore(K, Q3) → stored! Q3 is now in map, C starts convoy for Q3
5. B: Delete(K)     — removes Q3 (intended to remove Q1!)
6. B: goto createNew → LoadOrStore(K, Q4) → stored, starts convoy for Q4

Result: Q3's convoy is orphaned (same mechanism as 2.1)

This is a lower-probability race (requires two concurrent acquireQueue calls for same key during draining window), but it compounds the primary issue.

Fix: Use sync.Map.CompareAndDelete(key, q) here too — only delete if the value is still the draining queue we observed.

2.3 `IsLikelyQuicInitialPacket` False Positive Rate — Analysis CORRECT but understated

The documented rate of 6.25% (16/256) is mathematically correct:

Required: bit7=1 (long header), bit6=1 (fixed bit), bits5:4=00 (Initial type)
Matching first bytes: 0xC0-0xCF

What the analysis understates: this affects ALL UDP traffic, not just QUIC. DNS responses, game packets, VoIP, etc. all flow through this check. With dns_udp4_info=936 in 15 minutes (~1/sec), even DNS alone produces ~4 false positives per minute. Each creates a queue+convoy that should age out in 100ms — but due to the tryDeleteQueue race, a fraction become permanent orphans.

The 705 orphaned convoys represent the cumulative leak since the last restart (~17 hours based on goroutine 19 age of 1028 minutes).

2.4 `agingTime=100ms` Impact — Analysis CORRECT

The 100ms aging time amplifies the race in two ways:

More frequent timer firings → more opportunities to hit the tryDeleteQueue race
The 10ms sleep in the draining path is 10% of the aging time, creating a larger window for concurrent acquireQueue calls

3. Commit Attribution — CONFIRMED

Commit	Role	Evidence
`a795323` (2026-02-17)	Introduced defect: `LoadAndDelete` + pointer compare pattern	`git blame` on tryDeleteQueue
`41d20f2` (2026-02-25)	Triggered regression: 100ms aging + broader QUIC path usage	`git blame` on `UdpTaskPoolAgingTime`

4. Fix Recommendations — Review & Amendments

4.1 MUST FIX: `tryDeleteQueue` → `CompareAndDelete` ✅ Agree

func (p *UdpTaskPool) tryDeleteQueue(key netip.AddrPort, expected *UdpTaskQueue) bool {
    return p.queues.CompareAndDelete(key, expected)
}

sync.Map.CompareAndDelete is available since Go 1.20; project uses Go 1.26. Pointer comparison via any interface equality works correctly.

4.2 MUST FIX (MISSING): `acquireQueue` draining path

The p.queues.Delete(key) at line 246 should also use CompareAndDelete:

if q.draining.Load() {
    p.queues.CompareAndDelete(key, q)  // only delete if still the draining queue
    goto createNew
}

4.3 SHOULD FIX: Tighten `IsLikelyQuicInitialPacket` ✅ Agree

Add version field check (bytes 1-4). All deployed QUIC versions use specific version numbers:

QUIC v1: 0x00000001
QUIC v2: 0x6b3343cf
Version negotiation: 0x00000000

Even checking that buf[1:5] matches a known version set would cut false positives dramatically.

4.4 SHOULD EVALUATE: `agingTime=100ms`

100ms is aggressive. Consider:

1s default with configurable override
Or adaptive: start at 100ms for known-QUIC (version-verified), use longer timeout for unverified matches

4.5 OPTIONAL: Convoy goroutine exit safety net

Add a maximum lifetime or iteration count to convoy's cleanup loop as defense-in-depth. If tryDeleteQueue fails N times consecutively with no tasks received, force-exit. This prevents permanent leaks from any future race conditions.

5. Validation Criteria — Agree with additions

Original 6D40 criteria are correct. Add:

No convoy goroutine should exist without a corresponding entry in queues sync.Map (can verify via pprof goroutine dump + metrics endpoint udp_task_pool_count)
After fix deployment, convoy goroutine count should track queue count closely (within single-digit delta at any point)
CPU should drop to baseline within minutes, not gradually (the fix eliminates the leak, existing orphans exit on restart)

6. Summary Assessment

Aspect	Rating	Notes
Problem identification	Excellent	Correct root cause, correct commit attribution
Evidence quality	Excellent	pprof CPU + goroutine + metrics all consistent
Race analysis	Good	Primary race correct; missed secondary race in acquireQueue
Fix proposal	Good	CompareAndDelete is the right fix; missing acquireQueue fix
Severity assessment	Accurate	Real regression, accumulates over time, will worsen

Bottom line: The analysis is sound and actionable. Apply both CompareAndDelete fixes (tryDeleteQueue + acquireQueue draining path), tighten QUIC detection, and the regression will be resolved.

olicesx · 2026-02-27T03:40:48Z

@olicesx
供参考

udp-taskpool-cpu-regression summary

已确认

705/1077 goroutine 是 convoy，CPU 占比 34.4%（cumulative），与文档一致

tryDeleteQueue 的 LoadAndDelete + 指针比较确实不是原子 CAS — 会删掉新队列并导致 goroutine 永久泄漏

commit 归因正确：a795323 引入缺陷，41d20f2 触发放大

selectgo 80ms flat 是最大的单项 flat 开销 — 数百个 convoy 每 100ms 被调度一次导致的调度器抖动

新发现（原文未覆盖）

acquireQueue 第 246 行的 p.queues.Delete(key) 存在同类竞态 — 两个 goroutine 同时 acquireQueue 且都看到 draining=true 时，后者的 Delete 可能误删新队列。此处也需要改为
CompareAndDelete

泄漏的 convoy goroutine 永远无法退出（key 不在 map 中 → tryDeleteQueue 始终返回 false → 无限循环），705 个是重启以来 ~17 小时的累计泄漏量

修复建议

必做：tryDeleteQueue → sync.Map.CompareAndDelete（Go 1.20+，项目用 1.26）

必做：acquireQueue draining 路径的 Delete 也改为 CompareAndDelete

建议：收紧 QUIC 检测（增加 version 字段校验），降低 6.25% 的误命中率

建议：评估 100ms agingTime 是否过于激进

2026-02-27-udp-taskpool-cpu-regression-analysis

# dae CPU 回归排查记录（2026-02-27）
1. 背景与现象

观察到在 2026-02-26 16:52:43 部署 unstable-202602 9E88 26.pr-26.r44.0da556 后，CPU usage 从常态 1%~3% 上升到约 6%~10%。

业务负载体感无明显增加，属于版本后回归现象。

当前 pprof 访问端口为 :5556（endpoint_listen_address），并非 pprof_port 直绑 :6060。

2. 关键证据（本次排查）

2.1 在线指标

go_goroutines 约 1035~1084。

process_cpu_seconds_total 30 秒窗口估算 CPU 约 9.47%。

15 分钟日志统计（示例窗口）：

dns_udp4_info=936

dns_tcp4_info=97

udp_tuple_miss=5

其他高噪 debug/warn（如 rewrite/conn_check/udp_readloop）在当前窗口接近 0。

2.2 CPU pprof（2026-02-27 09:42:34）

github.com/daeuniverse/dae/control.(*UdpTaskQueue).convoy 累计约 31.71%。

runtime 调度热点显著：

runtime.schedule/findRunnable/selectgo 占比较高。

convoy 行级热点集中在：

control/udp_task_pool.go:151（select 循环）

control/udp_task_pool.go:167（time.Sleep(10ms)）

control/udp_task_pool.go:177（tryDeleteQueue）

2.3 goroutine pprof

(*UdpTaskQueue).convoy goroutine 约 670，为总 goroutine 主体。

goroutine 快照中大量栈停在 control/udp_task_pool.go:151，与 CPU 调度热点一致。

3. 根因分析

3.1 直接缺陷（根因）

control/udp_task_pool.go 的 tryDeleteQueue 当前实现：

先 LoadAndDelete(key)，后比较 expected 指针。

代码由 a795323 引入（refactor(control): optimize memory alignment and improve task queue management，2026-02-17）。

问题点：

该逻辑不是原子“按期望值删除”。

并发下可能删掉“新队列映射”，随后比较失败返回 false，但删除已发生。

结果是队列映射状态异常，旧/新 convoy 都可能进入长期空转循环，导致 goroutine 与调度开销持续偏高。

3.2 放大因素（触发本次回归）

41d20f2（refactor(control): optimize UDP handling for QUIC packets and enhance task pool management，2026-02-25）引入两项变化，显著放大上述缺陷影响：

将 UdpTaskPool 触发条件改为 IsLikelyQuicInitialPacket 命中时使用；

将 agingTime 从 DefaultNatTimeout（历史为 30s）改为 100ms。

IsLikelyQuicInitialPacket 仅基于首字节位判断（component/sniffing/quic.go），理论命中空间约 16/256 = 6.25%，存在较高“伪命中”概率，导致更多队列创建/回收尝试。

100ms 老化周期使 convoy 更频繁进入清理分支，进一步放大删除竞态造成的调度成本。

4. commit 溯源结论

引入缺陷的 commit：a795323（2026-02-17）

证据：git blame control/udp_task_pool.go:263-270 指向 a795323，首次引入 LoadAndDelete + pointer compare 方案。

触发/放大回归的 commit：41d20f2（2026-02-25）

证据：git blame control/udp_task_pool.go:20-25 指向 41d20f2，引入 UdpTaskPoolAgingTime = 100ms；

同提交将 control_plane 改为部分 UDP 路径重新走 UdpTaskPool。

5. 建议修复方案

5.1 必选修复（先做）

将 tryDeleteQueue 改为真正原子语义：

使用 sync.Map.CompareAndDelete(key, expected)（Go 版本满足）；

禁止“先删后比对”。

目标：消除错误删除新映射导致的队列异常与 convoy 常驻空转。

5.2 建议修复（次步）

收紧 IsLikelyQuicInitialPacket 快速判定，降低伪命中：

增加更强头部约束（不仅首字节位）。

评估 agingTime=100ms 是否过激进：

可回调到更保守值（如秒级）或采用自适应策略。

5.3 可观测性补充

增加 udp_task_pool 相关 metrics：

当前队列数、创建数、删除成功数、删除失败数、convoy 活跃数。

便于回归版本对比与告警。

6. 修复后验收标准

go_goroutines 中 (*UdpTaskQueue).convoy 数量明显下降，不再长期维持数百级空转。

CPU 回落接近历史基线（业务负载相近时接近 1%~3% 区间）。

pprof 中 UdpTaskQueue.convoy 与调度热点占比显著下降。
Opus-reviewed

# Opus Review: UDP TaskPool CPU Regression Analysis
Reviewed: 2026-02-27
Original doc: .plan/2026-02-27-udp-taskpool-cpu-regression-analysis.md
Status: Analysis largely correct; additional race conditions and nuances identified below.

1. Verification of Live Evidence

All key claims independently confirmed via live pprof (captured 2026-02-27 ~10:00 CST):

Metric Documented Live Measurement Match

Total goroutines ~1035-1084 1077 ✅

convoy goroutines ~670 705 (607 in select state) ✅

convoy CPU (cum) ~31.71% 34.41% (320ms/930ms) ✅

Line 151 select hotspot dominant 160ms (17.2% of total) ✅

Line 167 time.Sleep(10ms) visible 40ms (4.3%) ✅

Line 182 safeTimerReset visible 70ms (7.5%) ✅

selectgo runtime overhead high 80ms flat, 180ms cum (19.4%) ✅

Key observations:

705 convoy goroutines are in select state without explicit wait duration — meaning they are being scheduled frequently (every ~100ms timer tick), not sleeping.

runtime.selectgo at 80ms flat is the single largest flat-time consumer — direct evidence of scheduler thrashing from hundreds of convoy goroutines waking on timer.

runtime.schedule/findRunnable at 370-380ms cumulative confirms scheduler overhead from excessive runnable goroutines.

2. Root Cause Validation

2.1 tryDeleteQueue Race — CONFIRMED, with additional detail

The documented race is correct. Here is the precise sequence:
Timeline for key K:

1. Old convoy (Q1): timer fires → refs==0, ch empty
   → sets draining=true → sleeps 10ms → final check passes
   → calls tryDeleteQueue(K, Q1)

2. Meanwhile, acquireQueue(K) runs:
   → Load(K) returns Q1 → sees draining==true → goto createNew
   → LoadOrStore(K, Q2) → succeeds (stores Q2, returns loaded=false)
   → starts new convoy goroutine for Q2

3. tryDeleteQueue(K, Q1) executes:
   → LoadAndDelete(K) → removes Q2 from map, returns Q2
   → Q2 != Q1 → returns false

Result:
- Q2 is removed from the map, but Q2's convoy goroutine is running
- Q2's convoy can NEVER exit: tryDeleteQueue always returns false
  (key no longer in map → loaded=false → returns false)
- Q2 loops forever: timer fires (100ms) → draining → sleep(10ms) →
  tryDeleteQueue fails → reset timer → repeat
This is a goroutine leak with O(n) accumulation — each false-positive QUIC match for a unique AddrPort can produce an orphaned convoy.

2.2 Additional Race in acquireQueue Draining Path — NOT documented

The original analysis missed a second race at udp_task_pool.go:246:
if q.draining.Load() {
    p.queues.Delete(key)    // ← non-atomic delete of whatever is currently at key
    goto createNew
}
Scenario: Two goroutines (A, B) both call acquireQueue(K) and both see draining=true:
1. A: LoadOrStore(K) → loaded=true, returns Q1 (draining)
2. B: LoadOrStore(K) → loaded=true, returns Q1 (draining)
3. A: Delete(K)     — removes Q1
4. C: LoadOrStore(K, Q3) → stored! Q3 is now in map, C starts convoy for Q3
5. B: Delete(K)     — removes Q3 (intended to remove Q1!)
6. B: goto createNew → LoadOrStore(K, Q4) → stored, starts convoy for Q4

Result: Q3's convoy is orphaned (same mechanism as 2.1)
This is a lower-probability race (requires two concurrent acquireQueue calls for same key during draining window), but it compounds the primary issue.

Fix: Use sync.Map.CompareAndDelete(key, q) here too — only delete if the value is still the draining queue we observed.

2.3 IsLikelyQuicInitialPacket False Positive Rate — Analysis CORRECT but understated

The documented rate of 6.25% (16/256) is mathematically correct:

Required: bit7=1 (long header), bit6=1 (fixed bit), bits5:4=00 (Initial type)

Matching first bytes: 0xC0-0xCF

What the analysis understates: this affects ALL UDP traffic, not just QUIC. DNS responses, game packets, VoIP, etc. all flow through this check. With dns_udp4_info=936 in 15 minutes (~1/sec), even DNS alone produces ~4 false positives per minute. Each creates a queue+convoy that should age out in 100ms — but due to the tryDeleteQueue race, a fraction become permanent orphans.

The 705 orphaned convoys represent the cumulative leak since the last restart (~17 hours based on goroutine 19 age of 1028 minutes).

2.4 agingTime=100ms Impact — Analysis CORRECT

The 100ms aging time amplifies the race in two ways:

More frequent timer firings → more opportunities to hit the tryDeleteQueue race

The 10ms sleep in the draining path is 10% of the aging time, creating a larger window for concurrent acquireQueue calls

3. Commit Attribution — CONFIRMED

Commit Role Evidence

a795323 (2026-02-17) Introduced defect: LoadAndDelete + pointer compare pattern git blame on tryDeleteQueue

41d20f2 (2026-02-25) Triggered regression: 100ms aging + broader QUIC path usage git blame on UdpTaskPoolAgingTime

4. Fix Recommendations — Review & Amendments

4.1 MUST FIX: tryDeleteQueue → CompareAndDelete ✅ Agree
func (p *UdpTaskPool) tryDeleteQueue(key netip.AddrPort, expected *UdpTaskQueue) bool {
    return p.queues.CompareAndDelete(key, expected)
}
sync.Map.CompareAndDelete is available since Go 1.20; project uses Go 1.26. Pointer comparison via any interface equality works correctly.

4.2 MUST FIX (MISSING): acquireQueue draining path

The p.queues.Delete(key) at line 246 should also use CompareAndDelete:
if q.draining.Load() {
    p.queues.CompareAndDelete(key, q)  // only delete if still the draining queue
    goto createNew
}
4.3 SHOULD FIX: Tighten IsLikelyQuicInitialPacket ✅ Agree

Add version field check (bytes 1-4). All deployed QUIC versions use specific version numbers:

QUIC v1: 0x00000001

QUIC v2: 0x6b3343cf

Version negotiation: 0x00000000

Even checking that buf[1:5] matches a known version set would cut false positives dramatically.

4.4 SHOULD EVALUATE: agingTime=100ms

100ms is aggressive. Consider:

1s default with configurable override

Or adaptive: start at 100ms for known-QUIC (version-verified), use longer timeout for unverified matches

4.5 OPTIONAL: Convoy goroutine exit safety net

Add a maximum lifetime or iteration count to convoy's cleanup loop as defense-in-depth. If tryDeleteQueue fails N times consecutively with no tasks received, force-exit. This prevents permanent leaks from any future race conditions.

5. Validation Criteria — Agree with additions

Original criteria are correct. Add:

No convoy goroutine should exist without a corresponding entry in queues sync.Map (can verify via pprof goroutine dump + metrics endpoint udp_task_pool_count)

After fix deployment, convoy goroutine count should track queue count closely (within single-digit delta at any point)

CPU should drop to baseline within minutes, not gradually (the fix eliminates the leak, existing orphans exit on restart)

6. Summary Assessment

Aspect Rating Notes

Problem identification Excellent Correct root cause, correct commit attribution

Evidence quality Excellent pprof CPU + goroutine + metrics all consistent

Race analysis Good Primary race correct; missed secondary race in acquireQueue

Fix proposal Good CompareAndDelete is the right fix; missing acquireQueue fix

Severity assessment Accurate Real regression, accumulates over time, will worsen

Bottom line: The analysis is sound and actionable. Apply both CompareAndDelete fixes (tryDeleteQueue + acquireQueue draining path), tighten QUIC detection, and the regression will be resolved.

感谢审阅

MaurUppi · 2026-02-27T03:53:04Z

@olicesx
供参考

感谢审阅

感谢你的一直贡献才对

测试/构建细节见：MaurUppi#26

EDITED

PR#26 before/after 报告 <-- 追加 24h 后的同口径复测（两段窗口 + pprof）

udp-taskpool CPU 回归修复前后对比监测（2026-02-27）

1. 监测目标与对象

对象：PR#26 修订版部署后（dae 自 2026-02-27 11:37:37 CST 运行）
目标：对比修复前后 CPU / goroutine / UdpTaskQueue.convoy 行为，验证回归是否消除
参考基线：.plan/2026-02-27-udp-taskpool-cpu-regression-analysis.md

2. 修复前基线（Before）

时间点：2026-02-27 09:42:34 CST（旧版本排查时）

go_goroutines: 约 1035~1084
30s 窗口 CPU 估算：约 9.47%
goroutine pprof：(*UdpTaskQueue).convoy 约 670
CPU pprof：(*UdpTaskQueue).convoy cumulative 约 31.71%

3. 修复后监测（After）

说明：按“两段监测”执行。
运行中发现 metrics 路径为 /metrics（/debug/metrics 返回 404），第二段已按 /metrics 重采并校正。

3.1 第一段（启动后早期窗口）

监测时间段：2026-02-27 12:02:52 CST 到 2026-02-27 12:14:56 CST

index	timestamp	go_goroutines	go_threads	process_cpu_seconds_total	rss_bytes	convoy_goroutines
0	2026-02-27 12:02:52 CST	323	13	33.07	8.7605248e+07	2
1	2026-02-27 12:04:53 CST	357	13	35.75	9.0226688e+07	2
2	2026-02-27 12:06:53 CST	392	13	38.85	8.9423872e+07	2
3	2026-02-27 12:08:54 CST	710	13	43.11	9.4928896e+07	2
4	2026-02-27 12:10:55 CST	573	13	48.65	9.5891456e+07	7
5	2026-02-27 12:12:55 CST	540	13	58.31	9.9557376e+07	9
6	2026-02-27 12:14:56 CST	567	13	66.99	1.04476672e+08	14

汇总：

go_goroutines 平均 494.57，峰值 710
convoy_goroutines 平均 5.43，峰值 14
CPU 估算（Δprocess_cpu_seconds_total / Δwall * 100）：4.69%

pprof（第一段末尾，2026-02-27 12:15:10~12:15:10 CST）：

CPU profile：总样本 780ms（30s，约 2.60%）
- (*UdpTaskQueue).convoy cumulative 50ms（6.41%）
Goroutine profile：总 546，其中 convoy 14
Heap inuse：约 30.4MB

3.2 第二段（启动后 1h+ 窗口）

监测时间段：2026-02-27 12:50:17 CST 到 2026-02-27 13:02:19 CST

index	timestamp	go_goroutines	go_threads	process_cpu_seconds_total	rss_bytes	convoy_goroutines
0	2026-02-27 12:50:17 CST	461	13	125.56	1.01920768e+08	18
1	2026-02-27 12:52:17 CST	389	13	128.57	9.6444416e+07	19
2	2026-02-27 12:54:18 CST	412	13	132.37	9.7374208e+07	19
3	2026-02-27 12:56:18 CST	382	13	135.21	9.80992e+07	19
4	2026-02-27 12:58:18 CST	381	13	137.94	9.2954624e+07	21
5	2026-02-27 13:00:19 CST	372	13	141.89	9.0628096e+07	22
6	2026-02-27 13:02:19 CST	397	13	145.14	9.1152384e+07	22

汇总：

go_goroutines 平均 399.14，峰值 461
convoy_goroutines 平均 20.00，峰值 22
CPU 估算（Δprocess_cpu_seconds_total / Δwall * 100）：2.71%

pprof（第二段末尾，2026-02-27 13:02:28~13:02:58 CST）：

CPU profile：总样本 800ms（30s，约 2.67%）
- (*UdpTaskQueue).convoy cumulative 50ms（6.25%）
Goroutine profile：总 416，其中 convoy 22
Heap inuse：约 29.4MB

4. Before/After 结论

4.1 回归修复有效（核心指标）

CPU：~9.47%（Before）下降到 4.69%（After-Phase1）并进一步到 2.71%（After-Phase2）
总 goroutine：~1035~1084（Before）下降到约 399~495 平均区间
convoy goroutine：~670（Before）下降到 14（Phase1 末）/ 22（Phase2 末）
convoy CPU 累计占比：31.71%（Before）下降到 ~6.3%

4.2 当前状态判断

未再观察到“数百个 convoy 常驻 + 调度器抖动”的旧故障形态。
第二段 convoy 从 18 增至 22，属于低位波动，量级远低于修复前；当前证据不足以判定新的泄漏。

5. 后续建议

持续运行 24h 做一次同口径复测（同样两段窗口 + pprof），确认 convoy 是否稳定在低双位数以内。
给 udp_task_pool 增加专门 metrics（队列数、创建/删除成功、删除失败、convoy 活跃数），减少后续只能靠 pprof grep 的观测盲区。
若后续仍见 CPU 波动，可在相同负载下做 A/B：临时降低 metrics 拉取频率，排除采集本身造成的用户态开销噪声。

6. 2026-02-28 同口径复测（两段窗口 + pprof）

复测目标：验证第 5 节建议中的关键判断，即 convoy 是否能稳定在低双位数。

口径说明（与第 3 节一致）：

每段窗口 7 个采样点（约 12 分钟，2 分钟间隔）
每段窗口末尾抓取 30 秒 CPU pprof、goroutine pprof、heap pprof
监测入口：/metrics 与 /debug/pprof/*（http://127.0.0.1:5556，通过 ssh dae 采集）
两段窗口间隔：30 分钟

原始采样文件：.plan/udp-taskpool-cpu-regression/retest-20260228-105118/

6.1 第一段窗口

监测时间段：2026-02-28 10:51:18 CST 到 2026-02-28 11:03:22 CST

index	timestamp	go_goroutines	go_threads	process_cpu_seconds_total	rss_bytes	convoy_goroutines
0	2026-02-28 10:51:18 CST	967	14	3416.64	1.355776e+08	360
1	2026-02-28 10:53:18 CST	977	14	3425.39	1.37412608e+08	546
2	2026-02-28 10:55:19 CST	985	14	3434.36	1.38723328e+08	592
3	2026-02-28 10:57:20 CST	969	14	3443.3	1.38190848e+08	593
4	2026-02-28 10:59:20 CST	930	14	3451.83	1.4422016e+08	549
5	2026-02-28 11:01:21 CST	952	14	3460.12	1.4422016e+08	595
6	2026-02-28 11:03:22 CST	989	14	3469.04	1.44351232e+08	319

汇总：

go_goroutines 平均 967.00，峰值 989
convoy_goroutines 平均 507.71，峰值 595
CPU 估算（Δprocess_cpu_seconds_total / Δwall * 100）：7.24%

pprof（第一段末尾，2026-02-28 11:03:22~11:03:53 CST）：

CPU profile：总样本 2.08s（30s，约 6.93%）
- (*UdpTaskQueue).convoy cumulative 0.85s（40.87%）
Goroutine profile：总 990，其中 convoy 588
Heap inuse：约 50.70MB

6.2 第二段窗口

监测时间段：2026-02-28 11:33:55 CST 到 2026-02-28 11:46:01 CST

index	timestamp	go_goroutines	go_threads	process_cpu_seconds_total	rss_bytes	convoy_goroutines
0	2026-02-28 11:33:55 CST	1089	14	3611.72	1.44420864e+08	485
1	2026-02-28 11:35:56 CST	1088	14	3620.69	1.46649088e+08	606
2	2026-02-28 11:37:57 CST	1022	14	3629.28	1.46649088e+08	606
3	2026-02-28 11:39:58 CST	956	14	3638.99	1.46649088e+08	356
4	2026-02-28 11:41:59 CST	1042	14	3648.21	1.46509824e+08	590
5	2026-02-28 11:44:00 CST	1051	14	3657.28	1.46509824e+08	583
6	2026-02-28 11:46:01 CST	1072	14	3666.1	1.47165184e+08	610

汇总：

go_goroutines 平均 1045.71，峰值 1089
convoy_goroutines 平均 548.00，峰值 610
CPU 估算（Δprocess_cpu_seconds_total / Δwall * 100）：7.49%

pprof（第二段末尾，2026-02-28 11:46:01~11:46:33 CST）：

CPU profile：总样本 1.88s（30s，约 6.27%）
- (*UdpTaskQueue).convoy cumulative 0.71s（37.77%）
Goroutine profile：总 1068，其中 convoy 311
Heap inuse：约 49.27MB（49265.42kB）

6.3 对“低双位数稳定性”的复测结论

结论：未通过。本轮两段窗口中 convoy_goroutines 长时间维持在数百级（均值 507.71 / 548.00），并未稳定在低双位数。
与 2026-02-27 第 3.2 节（第二段）相比：
- convoy_goroutines 均值：20.00 -> 548.00
- go_goroutines 均值：399.14 -> 1045.71
- CPU 估算：2.71% -> 7.49%
- convoy CPU cumulative：6.25% -> 37.77%

7. 新发现与下一步（Systematic Debugging）

基于本次复测证据，当前更接近“convoy 再次放大/持续存在”的故障形态，而非“低位稳定波动”。

建议下一步按同一流程继续 Phase 1（仅取证，不先改代码）：

固定一个时间点，采集运行实例版本与启动时间（确认是否与预期 commit 一致）。
在相同采样点同步抓取 udp_task_pool_count 与 9E88 新增 compare-and-delete 相关指标（若已暴露），核对“队列数 vs convoy 数”是否背离。
对比本轮与 2026-02-27 的流量模式差异（DNS QPS、连接建立速率、策略命中分布），排除负载侧放大因素。

8. 2026-02-28 追加取证：剩余“convoy 常驻”路径验证（不改代码）

8.1 取证目标

验证在已合入 CAS 修复后，当前实现中是否仍存在可导致 convoy 常驻的执行路径。
本节仅做 Phase 1 证据收集，不包含代码修复。

8.2 关键实时证据（线上）

采样时间：2026-02-28 12:26~12:32 CST（ssh dae 拉取 /metrics + goroutine pprof）

单点样本：

dae_udp_task_queues_active = 0
go_goroutines ≈ 1052
goroutine pprof：
- convoy_line151（control/udp_task_pool.go:151）约 599
- convoy_line167（control/udp_task_pool.go:167）约 25

短时序样本（共 8 点，节选）：

timestamp	go_goroutines	convoy_line151	convoy_line167
2026-02-28 12:26:39 CST	1052	624	0
2026-02-28 12:27:09 CST	1061	533	91
2026-02-28 12:27:39 CST	1055	624	0
2026-02-28 12:28:10 CST	1030	523	102
2026-02-28 12:28:40 CST	1066	625	0
2026-02-28 12:31:28 CST	1072	549	77
2026-02-28 12:31:43 CST	1070	626	0
2026-02-28 12:31:58 CST	1065	626	0

观测结论：

queue_active 持续为 0，但 convoy 在 line151 持续数百，不是瞬时抖动。
说明存在“未在 DefaultUdpTaskPool.queues 中、但仍存活”的 convoy goroutine 群体。

8.3 与代码路径的对应关系

8.4 本节结论（Phase 1）

已拿到“指标背离 + 栈位点 + 代码路径”三类证据，支持存在剩余的 convoy 常驻路径。
当前证据已足以进入下一步最小化验证（Phase 2/3）：围绕“tryDeleteQueue 失败后的退出条件”做单变量实验验证。

9. Phase 2/3：单变量最小化验证（`tryDeleteQueue` 失败后的退出条件）

9.1 假设（单一）

假设：当 convoy 在清理阶段调用 tryDeleteQueue 失败，且 map 中已不存在该 key -> q 映射时，当前实现会继续循环而不是退出，形成“脱离 map 的常驻 convoy”。
依据：第 8 节已经观测到 queue_active=0 与 convoy(line151) 数百并存。

9.2 变量控制

保持不变：
- 生产代码逻辑不改动（仅新增测试）。
- 队列老化机制、convoy 主循环路径不改。
仅操纵一个变量：
- 在 convoy 进入 draining 后，先由“并发路径”删除 map 中的 key -> q 映射，再让 convoy 执行其自身删除。

9.3 实验实现

新增诊断测试：control/udp_task_pool_phase3_validation_test.go
测试名：TestPhase3_ConvoyPersistsWhenQueueMappingDeletedBeforeSelfDelete
核心步骤：
1. 启动一个 convoy 并等待其进入 draining=true。
2. 先执行一次 pool.tryDeleteQueue(key, q)，确保 map 计数变为 0。
3. 等待 convoy 醒来尝试自删。
4. 断言：convoy 未退出（仍存活），但 map 计数已为 0。
5. 为了测试收尾，恢复 key -> q 映射，使其下一轮可正常退出。

9.4 执行与结果

执行时间：2026-02-28 12:56~12:57 CST

执行命令（Linux 容器口径）：

docker run --rm -e GOTOOLCHAIN=auto -v "$PWD":/src -w /src golang:1.25 \
  go test ./control -run TestPhase3_ConvoyPersistsWhenQueueMappingDeletedBeforeSelfDelete -v

结果：PASS

执行命令（同用例 + race）：

docker run --rm -e GOTOOLCHAIN=auto -v "$PWD":/src -w /src golang:1.25 \
  go test -race ./control -run TestPhase3_ConvoyPersistsWhenQueueMappingDeletedBeforeSelfDelete -v

结果：PASS

9.5 结论（Phase 2/3）

该单变量实验支持第 9.1 节假设：在“映射被先删”场景下，convoy 可在 map 计数为 0 时继续存活。
这与线上“queue_active=0 但 convoy 仍大量驻留”的现象同向一致。
当前阶段结论：已完成“证据确认”，下一步可进入 Phase 4（在明确预期退出语义后，做最小代码修复与回归测试）。

10. Phase 4：最小修复与回归验证

10.1 期望语义（先固化为测试）

期望：当 convoy 清理时 tryDeleteQueue 失败，且 map 已不再持有 key -> 当前q 映射（不存在或被新队列替换），旧 convoy 必须退出，不能继续轮询常驻。

测试实现：

新增回归测试：control/udp_task_pool_convoy_exit_test.go
用例：TestConvoyExitsWhenQueueMappingDeletedBeforeSelfDelete
TDD 红灯确认：在修复前该用例失败（convoy did not exit after mapping was deleted before self-delete）。

10.2 最小代码修复

文件：control/udp_task_pool.go（convoy 的 timer 清理分支）

变更点：

保持原有成功删除分支不变（tryDeleteQueue==true 时归还 channel 并退出）。
在 tryDeleteQueue==false 后增加映射归属检查：
- 若 map 不存在该 key，或 key 对应值已不是当前 q，判定当前 convoy 为 stale；
- 立即归还 channel 并退出；
- 不再进入 draining=false + timer reset + continue 的循环常驻路径。

10.3 验证结果

执行时间：2026-02-28 13:04~13:08 CST

新增回归测试（修复后）：
- go test ./control -run TestConvoyExitsWhenQueueMappingDeletedBeforeSelfDelete -v
- 结果：PASS
相关回归集合（与 CI 口径一致扩展）：
- go test ./control -run 'Test(UdpTaskPool|CompareAndDelete|NoGoroutineLeak|Convoy|HighConcurrencyStress)' -count=1
- 结果：PASS
race 口径：
- go test -race ./control -run 'Test(UdpTaskPool|CompareAndDelete|NoGoroutineLeak|Convoy|HighConcurrencyStress)' -count=1
- 结果：PASS
QUIC 检测回归：
- go test ./component/sniffing -run TestIsLikelyQuicInitialPacket -count=1
- 结果：PASS

10.4 Phase 4 结论

已完成“失败测试 -> 最小修复 -> 回归通过”的闭环。
修复直接针对“tryDeleteQueue 失败后的退出条件”，与 Phase 1~3 证据链一致，且未引入额外行为面改动。

MaurUppi · 2026-02-28T05:38:33Z

@olicesx

PR#936 需要补 fix 的 commit 定位

需要补 fix 的是 6c71a20（它做了 CAS 修复，但留下了残余路径）。
短因果：
- 6c71a20 将删除改为 CompareAndDelete，修复了“误删新队列”问题；
- 但在 convoy 中，tryDeleteQueue 失败后仍是 draining=false + reset + continue；
- 当映射已被并发删除/替换时，旧 convoy 会脱离 map 常驻循环，导致 convoy 累积和 CPU 抬升。
修复点在 udp_task_pool.go:182，新增“映射不再指向当前 q 时直接退出”。

我自己 fork repo 的 PR: #27

变更文件：
- control/udp_task_pool.go
- control/udp_task_pool_convoy_exit_test.go

metrics dashboard log

…ciency

…lity-fixes

…bility - Simplified the dnsForwarderKey structure by removing unnecessary dialArgument. - Added tests for ResetDnsForwarders to ensure in-flight forwarders are handled correctly. - Enhanced DNSListener to use atomic pointers for the ControlPlane, improving thread safety. - Updated dnsHandler to utilize the new Controller method for better error handling. - Introduced new methods in failedQuicDcidCache for managing shard storage and cleanup. - Improved routing matcher builder to retain state in snapshots and refactored kernspace building logic. - Added tests to verify the integrity of the routing kernspace snapshot. - Enhanced UDP handling with new packet sending functions to support advanced features.

The reload preparation path in cmd/run.go uses a 45-second timeout context that was leaking into ControlPlane lifecycle contexts via context.WithCancel(ctx). When the timeout fired, Serve() would exit and all traffic (both direct and proxy) would die. - Derive all CP-owned contexts from context.Background() instead of the caller's potentially-timed-out ctx - Add retired atomic.Bool to block stale health-check callback writes during drain - Add MarkRetired() to both staged and non-staged retirement goroutines - Add Serve() exit reason logging to distinguish normal vs timeout-driven exits Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

DNS controller workers (bpfUpdateWorker, janitor, evictor) watched baseContext().Done() which changes across reload generations. When the lifecycle context swapped during reload, workers would exit prematurely. Remove baseContext().Done() watches so workers survive across reloads. Workers are stopped via explicit stop channels closed in DnsController.Close(). Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

clearRejectedReloadProgress() hardcoded SignalProgressFilePath for reads, but tests override the writer to use temp files. Add getRunSignalProgress variable so tests can override both read and write paths. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

Three DNS router tests use real UDP sockets with SO_MARK which requires CAP_NET_ADMIN. Add skipIfNoSocketMark helper and skip these tests in CI containers that lack the capability. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

RestoreHealthSnapshot unconditionally set reloadInheritedHealth which added a full CheckInterval (~30s) delay before the first health check. When dialers inherited NOT-ALIVE state from the previous generation, they stayed unreachable for 30+ seconds after reload. Only defer the first health check when ALL inherited collections are ALIVE. NOT-ALIVE dialers need an immediate probe to recover connectivity. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

…behavior

…and limit response reads - errors: IsUDPEndpointNormalClose(nil) returns false to match companion function - netutils: comma-ok type assertion for logger from context value - subscription: cap io.ReadAll with 10MB LimitReader - config_merger: defer f.Close() after os.Open to prevent fd leak - rawsock_linux: syscall.Close(sock) on bind failure to prevent fd leak Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

…t DoH response reads - daedns: add singleflight.Group to LookupIPAddr for concurrent lookup dedup - daedns: use sync.Pool for 65535-byte UDP DNS buffers instead of per-query allocation - daedns: cap DoH response with io.LimitReader(resp.Body, 65535) - control/dns: cap DoH response with io.LimitReader(resp.Body, 65535) Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

…e exit, and connectivity check limit - routing: detect IPv6 with ':' and use /128 instead of /32 - outbound/filter: cache compiled regexp2 patterns in sync.Map - sniffing: select on ctx.Done() in readStreamOnceAsync to prevent goroutine leak - connectivity_check: cap debug body read with io.LimitReader(resp.Body, 4096) Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

…te audit-fix head - cmd/run: convert if-else chain to switch for golangci-lint gocritic - go.mod: replace local outbound with remote olicesx/outbound pseudo-version Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

- Introduced a shared timeout for DNS lookups to prevent blocking indefinitely. - Added functions to interrupt connections on context cancellation for both TCP and QUIC. - Enhanced the Router to utilize the new timeout and interruption mechanisms. - Updated tests to verify the behavior of deduplicated lookups and large UDP responses. - Modified the DNS forwarder to track consecutive errors and retire after a threshold. - Adjusted the handling of proxy TCP forwarders to retain them on ordinary transport errors. - Updated go.mod to use the latest outbound dependency version.

Copilot AI review requested due to automatic review settings February 15, 2026 07:30

olicesx requested review from a team as code owners February 15, 2026 07:30

dae-prow bot assigned olicesx Feb 8000 15, 2026

dae-prow bot added feature not-yet-tested labels Feb 15, 2026

Copilot started reviewing on behalf of olicesx February 15, 2026 07:31 View session

Copilot AI reviewed Feb 15, 2026

View reviewed changes

olicesx requested a review from a team as a code owner February 16, 2026 15:20

MarksonHon previously approved these changes Feb 17, 2026

View reviewed changes

olicesx dismissed MarksonHon’s stale review via a3bf927 February 17, 2026 13:57

olicesx force-pushed the optimize/code-quality-fixes branch 3 times, most recently from 71cf539 to 8828b24 Compare February 19, 2026 20:56

This was referenced Feb 21, 2026

feat: reliability improvement on bc7e4642 baseline #940

Closed

chore(eval): baseline build on upstream PR936 MaurUppi/dae#15

Closed

olicesx force-pushed the optimize/code-quality-fixes branch 2 times, most recently from 6ecfee3 to 348514f Compare February 22, 2026 20:57

MaurUppi mentioned this pull request Feb 23, 2026

[Enhancement] write: broken pipe , (FIN→RST) issue #915

Closed

olicesx force-pushed the optimize/code-quality-fixes branch from e9685c7 to 7eafc6e Compare March 2, 2026 01:52

kix added 5 commits April 8, 2026 20:56

refactor: enhance conntrack argument passing and improve wrapper effi…

454df05

…ciency

test(lint): make nil guards explicit for staticcheck

1aeb50c

test(lint): make test nil guards explicit

3c3ea2d

test(lint): tighten udp test nil guards

6727401

test(lint): guard udp test pointers explicitly

6bb8e6c

qi-mooo pushed a commit to qi-mooo/dae that referenced this pull request Apr 8, 2026

Merge pull request daeuniverse#936 from daeuniverse/optimize/code-qua…

ac4e32d

…lity-fixes

qi-mooo added a commit to qi-mooo/dae that referenced this pull request Apr 8, 2026

Merge pull request daeuniverse#936 from daeuniverse/optimize/code-qua…

318b526

…lity-fixes

kix and others added 23 commits April 12, 2026 18:20

fix(control): harden reload and lifecycle cleanup

776883f

fix(control): harden reload handoff lifecycle

a62c337

fix(control): avoid blackhole during stuck reload

cb3fd1b

fix(reload): harden staged handoff lifecycle

c7e6566

fix(reload): pin outbound fork for transport cache scoping

c634a04

fix(daedns): honor external geodata for request routing

6c2a155

feat(dns): enhance DNS dialer selection and error handling

1c325e3

fix(lint): check Close() error return in DNS reload test

260f7e6

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

fix(lint): suppress syscall.Close errcheck in socket mark probe

c1f0d5c

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

feat(control-plane): enhance dialer health management and retirement …

2766b83

…behavior

fix(deps): update outbound to include anytls InsecureSkipVerify revert

d4c99ce

fix(daedns): preserve lookup semantics and refresh outbound

3069814

-			slot.set(&msg)
+			// Allocate a distinct message instance per response to avoid
+			// sharing the loop-local msg between concurrent requests.
+			resp := new(dnsmessage.Msg)
+			*resp = msg
+			slot.set(resp)

Conversation

Background

Checklist

Full Changelogs

Issue Reference

Test Result

Uh oh!

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

使用感觉一切正常。

Uh oh!

如果下述不适合发这里，请告诉我删除。

Uh oh!

Uh oh!

Uh oh!

Uh oh!

1. 背景与现象

2. 关键证据（本次排查）

2.1 在线指标

2.2 CPU pprof（2026-02-27 09:42:34）

2.3 goroutine pprof

3. 根因分析

3.1 直接缺陷（根因）

3.2 放大因素（触发本次回归）

4. commit 溯源结论

5. 建议修复方案

5.1 必选修复（先做）

5.2 建议修复（次步）

5.3 可观测性补充

6. 修复后验收标准

1. Verification of Live Evidence

2. Root Cause Validation

2.1 tryDeleteQueue Race — CONFIRMED, with additional detail

2.2 Additional Race in acquireQueue Draining Path — NOT documented

2.3 IsLikelyQuicInitialPacket False Positive Rate — Analysis CORRECT but understated

2.4 agingTime=100ms Impact — Analysis CORRECT

3. Commit Attribution — CONFIRMED

4. Fix Recommendations — Review & Amendments

4.1 MUST FIX: tryDeleteQueue → CompareAndDelete ✅ Agree

4.2 MUST FIX (MISSING): acquireQueue draining path

4.3 SHOULD FIX: Tighten IsLikelyQuicInitialPacket ✅ Agree

4.4 SHOULD EVALUATE: agingTime=100ms

4.5 OPTIONAL: Convoy goroutine exit safety net

5. Validation Criteria — Agree with additions

6. Summary Assessment

Uh oh!

1. 背景与现象

2. 关键证据（本次排查）

2.1 在线指标

2.2 CPU pprof（2026-02-27 09:42:34）

2.3 goroutine pprof

3. 根因分析

3.1 直接缺陷（根因）

3.2 放大因素（触发本次回归）

4. commit 溯源结论

5. 建议修复方案

5.1 必选修复（先做）

5.2 建议修复（次步）

5.3 可观测性补充

6. 修复后验收标准

1. Verification of Live Evidence

2. Root Cause Validation

2.1 tryDeleteQueue Race — CONFIRMED, with additional detail

2.2 Additional Race in acquireQueue Draining Path — NOT documented

2.3 IsLikelyQuicInitialPacket False Positive Rate — Analysis CORRECT but understated

2.4 agingTime=100ms Impact — Analysis CORRECT

3. Commit Attribution — CONFIRMED

4. Fix Recommendations — Review & Amendments

4.1 MUST FIX: tryDeleteQueue → CompareAndDelete ✅ Agree

4.2 MUST FIX (MISSING): acquireQueue draining path

4.3 SHOULD FIX: Tighten IsLikelyQuicInitialPacket ✅ Agree

4.4 SHOULD EVALUATE: agingTime=100ms

4.5 OPTIONAL: Convoy goroutine exit safety net

5. Validation Criteria — Agree with additions

2.1 `tryDeleteQueue` Race — CONFIRMED, with additional detail

2.2 Additional Race in `acquireQueue` Draining Path — NOT documented

2.3 `IsLikelyQuicInitialPacket` False Positive Rate — Analysis CORRECT but understated

2.4 `agingTime=100ms` Impact — Analysis CORRECT

4.1 MUST FIX: `tryDeleteQueue` → `CompareAndDelete` ✅ Agree

4.2 MUST FIX (MISSING): `acquireQueue` draining path

4.3 SHOULD FIX: Tighten `IsLikelyQuicInitialPacket` ✅ Agree

4.4 SHOULD EVALUATE: `agingTime=100ms`

2.1 `tryDeleteQueue` Race — CONFIRMED, with additional detail

2.2 Additional Race in `acquireQueue` Draining Path — NOT documented

2.3 `IsLikelyQuicInitialPacket` False Positive Rate — Analysis CORRECT but understated

2.4 `agingTime=100ms` Impact — Analysis CORRECT

4.1 MUST FIX: `tryDeleteQueue` → `CompareAndDelete` ✅ Agree

4.2 MUST FIX (MISSING): `acquireQueue` draining path

4.3 SHOULD FIX: Tighten `IsLikelyQuicInitialPacket` ✅ Agree

4.4 SHOULD EVALUATE: `agingTime=100ms`

9. Phase 2/3：单变量最小化验证（`tryDeleteQueue` 失败后的退出条件）