8000
Skip to content

feat(control): improve DNS fallback reliability and harden connection lifecycle#936

Open
olicesx wants to me 8000 rge 525 commits intodaeuniverse:mainfrom
olicesx:optimize/code-quality-fixes
Open

feat(control): improve DNS fallback reliability and harden connection lifecycle#936
olicesx wants to merge 525 commits intodaeuniverse:mainfrom
olicesx:optimize/code-quality-fixes

Conversation

@olicesx
Copy link
Copy Markdown
Contributor
@olicesx olicesx commented Feb 15, 2026

Background

This PR improves control-plane robustness with a focus on DNS forwarding reliability and concurrency safety.
Main updates include:

Improve tcp+udp DNS upstream behavior with robust UDP-first and TCP fallback handling.
Feed DNS forward failures into dialer health feedback so failover decisions can react faster.
Harden DNS/UDP connection lifecycle under high concurrency.
Add regression tests for fallback, timeout cleanup, and pool safety.

Checklist

Full Changelogs

feat(dns): add robust DNS forward fallback path for tcp+udp upstream (UDP-first with TCP fallback on request failure).
fix(dns): report DNS forward failures to dialer health feedback path to improve failover quality.
fix(control): harden DNS/UDP connection lifecycle handling in high-concurrency paths.
test(control): add regression tests for DNS fallback, timeout cleanup, and pool concurrency safety.

Issue Reference

Closes #[issue number]

Test Result

Environment: Linux

Passed:

go test ./common/consts ./component/sniffing ./control
go test -race ./control
go build .

Copilot AI review requested due to automatic review settings February 15, 2026 07:30
@olicesx olicesx requested review from a team as code owners February 15, 2026 07:30
Copy link
Copy Markdown
Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the control-plane DNS forwarding and connection lifecycle under concurrency, adding a UDP-first/TCP-fallback path for tcp+udp DNS upstreams, health feedback on forward failures, and multiple regression tests to validate fallback, cleanup, and pool safety.

Changes:

  • Add robust DNS forwarding behavior: UDP-first with TCP fallback for tcp+udp upstreams, plus dialer health feedback on forward failures and concurrency limiting/singleflight coalescing.
  • Improve UDP/TCP connection lifecycle and concurrency safety (UDP task pool GC, DNS connection pooling/pipelining, routing-result caching for UDP endpoints).
  • Add/expand regression tests and update changelog entries.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
control/utils.go Adjust routing matcher call signature and MAC formatting to fixed-size 16-byte inputs.
control/udp_task_pool_test.go Replace prior timing-based test with deterministic concurrency/order regression tests.
control/udp_task_pool.go Rework UDP task queue lifecycle/GC for concurrency safety and predictable idle cleanup.
control/udp_routing_cache_test.go Add tests for UDP endpoint routing-result cache hit/expire behavior.
control/udp_endpoint_pool.go Add per-endpoint routing-result cache with TTL and concurrency protection.
control/udp.go Add fast QUIC prefilter to avoid expensive sniffing; handle DNS concurrency-limit refusal explicitly.
control/tcp_test.go Add regression test ensuring RelayTCP cancellation unblocks the opposite direction.
control/tcp.go Use context-driven cancellation to interrupt the other copy direction promptly; switch to consts IPPROTO.
control/routing_matcher_userspace.go Change matcher inputs to fixed-size [16]uint8 and reduce per-call allocations.
control/packet_sniffer_pool_test.go Make packet sniffer tests isolate global pool state and remove sessions explicitly.
control/dns_pipelining_bench_test.go Add benchmarks around pipelined DNS conn, ID allocation, and contention patterns.
control/dns_pipelined_conn_test.go Add tests ensuring pipelined conn cleanup on success/timeout and input ID preservation.
control/dns_listener.go Fix TCP listener start logic and ignore concurrency-limit errors (REFUSED already written).
control/dns_id_bitmap_test.go Add tests for concurrent ID allocation uniqueness and reuse behavior.
control/dns_fallback_test.go Add tests for UDP→TCP fallback and for avoiding dialer poisoning on canceled contexts.
control/dns_control.go Major DNS controller hardening: concurrency limiter, singleflight coalescing, sync.Map caches, forwarder caching, fallback routing, and close lifecycle.
control/dns_conn_pool_test.go Add tests for UDP conn pool close/put races, conn pool dial contention, and responseSlot reuse.
control/dns_concurrency_test.go Add regression test validating concurrency limiter rejection behavior.
control/dns_cache.go Replace reflection-based deep copy with explicit RR copying and add DnsCache.Clone.
control/dns.go Implement pooled/pipelined DNS transports, TCP/TLS connection pooling, UDP conn pool, and improved cleanup paths.
control/control_plane_core_test.go Add race regression test validating atomic core flip behavior.
control/control_plane_core.go Make core flip atomic/CAS-based and apply best-effort qdisc/filter cleanup comments.
control/control_plane.go Wire DNS controller Close into lifecycle, switch to sync.Map cloning, and add UDP routing-result caching in receive path.
control/connectivity.go Use local consts IP protocol numbers instead of unix constants.
control/bpf_utils.go Add documentation notes for auto-generated BPF types; improve batch map error wrapping.
control/anyfrom_pool.go Gate GSO detection behind env var to keep GSO disabled by default but optionally testable.
component/sniffing/sniffer.go Add cheap QUIC header precheck to avoid costly sniffing when not applicable.
component/sniffing/quic_test.go Add unit tests for the QUIC initial-packet precheck.
component/sniffing/quic.go Add IsLikelyQuicInitialPacket helper for fast-path filtering.
common/consts/dialer_test.go Add tests for protocol conversion helpers.
common/consts/dialer.go Replace unix protocol constants with local IANA protocol numbers; add doc comments.
CHANGELOGS.md Add Unreleased section documenting DNS/control robustness improvements and tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread control/dns.go Outdated
if slot == nil {
continue
}
slot.set(&msg)
Copy link
Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pipelinedConn.readLoop, slot.set(&msg) passes a pointer to the loop-local msg variable. Because multiple in-flight requests can be pending concurrently, this can lead to responses sharing/overwriting the same underlying dnsmessage.Msg storage (and potential data races / wrong answers delivered). Allocate a distinct message per response before sending it to the slot (e.g., create a new dnsmessage.Msg and copy msg into it, or copy into a new variable whose address won’t be reused across iterations).

Suggested change
slot.set(&msg)
// Allocate a distinct message instance per response to avoid
// sharing the loop-local msg between concurrent requests.
resp := new(dnsmessage.Msg)
*resp = msg
slot.set(resp)

Copilot uses AI. Check for mistakes.
Comment thread control/dns_cache.go
Comment on lines 22 to +28
func (c *DnsCache) FillInto(req *dnsmessage.Msg) {
req.Answer = deepcopy.Copy(c.Answer).([]dnsmessage.RR)
if c.Answer != nil {
req.Answer = make([]dnsmessage.RR, len(c.Answer))
for i, rr := range c.Answer {
req.Answer[i] = dnsmessage.Copy(rr)
}
}
Copy link
Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DnsCache.FillInto only assigns req.Answer when c.Answer != nil. If the cache entry has a nil/empty Answer, this leaves any existing req.Answer intact, which can accidentally leak stale answers if the dnsmessage.Msg is reused. Always set req.Answer explicitly (set to nil/empty when c.Answer is nil) before setting the other response fields.

Copilot uses AI. Check for mistakes.
@olicesx olicesx requested a review from a team as a code owner February 16, 2026 15:20
MarksonHon
MarksonHon previously approved these changes Feb 17, 2026
@MaurUppi
Copy link
Copy Markdown

用 dnsperf 测试了 5000QPS, Client 10, Worker threads 2, Stats output interval 1

dnsperf -s 192.168.1.15 \
        -d targets/queries_cache_heavy.txt \
        -c 10 \
        -T 2 \
        -l 60 \
        -S 1 \
        -Q 5000

另外,也查询了自启动以来的broken pipe 记录数,相比 v1.0.0从几千大幅减少到 个位数
journalctl -u dae --since "2026-02-21 19:19:47" --no-pager | grep -c "broken pipe"

CleanShot 2026-02-22 at 11 45 41

使用感觉一切正常。

@olicesx olicesx force-pushed the optimize/code-quality-fixes branch 2 times, most recently from 6ecfee3 to 348514f Compare February 22, 2026 20:57
@MaurUppi
Copy link
Copy Markdown

如果下述不适合发这里,请告诉我删除。

feat(dns): optimize DNS caching with async write and lock-free upstream resolver 594f449

这个改进挺厉害的~~

维度 Benchmark 指标 7c2f407 -> 594f449 origin/main(5268be5) -> 594f449 结论
Upstream 热路径 BenchmarkUpstreamResolver_GetUpstream_Serial sec/op 5.400ns -> 2.218ns(-58.9%) 5.444ns -> 2.219ns(-59.2%) 明显更快
Upstream 热路径 BenchmarkUpstreamResolver_GetUpstream_Parallel sec/op 43.590ns -> 3.355ns(-92.3%) 48.750ns -> 3.878ns(-92.0%) 并发路径提升很大
Upstream 汇总 上述两项 geomean sec/op -82.22% -81.99% 趋势一致,收益稳定
DNS 控制面常规路径 PipelinedConn_Sequential sec/op 2.964us -> 2.975us(+0.4%) 3.056us -> 3.008us(-1.6%) 基本持平
DNS 控制面常规路径 PipelinedConn_Concurrent sec/op 3.326us -> 3.254us(-2.2%) 3.450us -> 3.446us(-0.1%) 基本持平
DNS 控制面常规路径 DnsController_Singleflight sec/op 0.04573n -> 0.04183n 0.03788n -> 0.04818n 微基准噪声大,参考价值低
控制面汇总 上述 3 项 geomean sec/op -3.51% +7.74% 受噪声/上下文影响,不建议单独作为结论
维度 Benchmark 结果(head-only) 说明
Async 缓存收益 BenchmarkAsyncCacheVsSyncCache Async ~723-923ns/op vs Sync ~1.53-1.64ms/op 量级约 1700x~2200x(同机同次)
防击穿 BenchmarkAsyncCacheWithSingleflight upstream_calls/op ≈ 0.9~1.0(1~1000 并发) singleflight 有效
高并发去重 BenchmarkHighQpsScenario deduplication_rate_% = 100% 去重行为符合预期

@MaurUppi
Copy link
Copy Markdown

@olicesx
在我的环境终端设备并不多,但是 short-live (QUIC)类的在 13 分钟就有 300 条,debug level 显示太多。所以,我在你的代码基础上 patch 了可观测性(除了第一条,然后每 300 条显示一次),你的原版不显示short-lived UDP fast path fallback 以及 dst/err/src<->对 日志过滤不够友好。

话说,你这个 PR 的 commit (100+)太庞大了,啥时候能完工呢。

02-26 20:19:55 level=debug msg="UDP routing tuple missing; short-lived UDP fast path fallback (Total=7200)" dst="192.168.1.15:53"
02-26 20:19:15 level=debug msg="UDP routing tuple missing; short-lived UDP fast path fallback" dst="198.18.0.2:53" error="reading map: key [192.168.1.171:57322, 17, 198.18.0.2:53]: lookup: key does not exist" src="192.168.1.171:57322"
02-26 20:06:49 level=debug msg="UDP routing tuple missing; short-lived UDP fast path fallback (Total=6900)" dst="192.168.1.15:53"

@olicesx
Copy link
Copy Markdown
Contributor Author
olicesx commented Feb 26, 2026

@MaurUppi debug 日志就是用来 debug 的,以及现在并没有改动太多日志相关内容后续再说吧

@MaurUppi
Copy link
Copy Markdown

@MaurUppi debug 日志就是用来 debug 的,以及现在并没有改动太多日志相关内容后续再说吧

debug level 显示太多只是我个人偏好而已,但后续若是改造日志的话,还是希望考虑结构性/方便过滤查询的方式。
期待 PR936 早点合并

@MaurUppi
Copy link
Copy Markdown

@olicesx
供参考

udp-taskpool-cpu-regression summary

已确认

  • 705/1077 goroutine 是 convoy,CPU 占比 34.4%(cumulative),与文档一致
  • tryDeleteQueue 的 LoadAndDelete + 指针比较确实不是原子 CAS — 会删掉新队列并导致 goroutine 永久泄漏
  • commit 归因正确:a795323 引入缺陷,41d20f2 触发放大
  • selectgo 80ms flat 是最大的单项 flat 开销 — 数百个 convoy 每 100ms 被调度一次导致的调度器抖动

新发现(原文未覆盖)

  • acquireQueue 第 246 行的 p.queues.Delete(key) 存在同类竞态 — 两个 goroutine 同时 acquireQueue 且都看到 draining=true 时,后者的 Delete 可能误删新队列。此处也需要改为
    CompareAndDelete
  • 泄漏的 convoy goroutine 永远无法退出(key 不在 map 中 → tryDeleteQueue 始终返回 false → 无限循环),705 个是重启以来 ~17 小时的累计泄漏量

修复建议

  1. 必做:tryDeleteQueue → sync.Map.CompareAndDelete(Go 1.20+,项目用 1.26)
  2. 必做:acquireQueue draining 路径的 Delete 也改为 CompareAndDelete
  3. 建议:收紧 QUIC 检测(增加 version 字段校验),降低 6.25% 的误命中率
  4. 建议:评估 100ms agingTime 是否过于激进

2026-02-27-udp-taskpool-cpu-regression-analysis

# dae CPU 回归排查记录(2026-02-27)

1. 背景与现象

  • 观察到在 2026-02-26 16:52:43 部署 unstable-20260226.pr-26.r44.0da556 后,CPU usage 从常态 1%~3% 上升到约 6%~10%
  • 业务负载体感无明显增加,属于版本后回归现象。
  • 当前 pprof 访问端口为 :5556endpoint_listen_address),并非 pprof_port 直绑 :6060

2. 关键证据(本次排查)

2.1 在线指标

  • go_goroutines1035~1084
  • process_cpu_seconds_total 30 秒窗口估算 CPU 约 9.47%
  • 15 分钟日志统计(示例窗口):
    • dns_udp4_info=936
    • dns_tcp4_info=97
    • udp_tuple_miss=5
    • 其他高噪 debug/warn(如 rewrite/conn_check/udp_readloop)在当前窗口接近 0。

2.2 CPU pprof(2026-02-27 09:42:34)

  • github.com/daeuniverse/dae/control.(*UdpTaskQueue).convoy 累计约 31.71%
  • runtime 调度热点显著:
    • runtime.schedule/findRunnable/selectgo 占比较高。
  • convoy 行级热点集中在:
    • control/udp_task_pool.go:151select 循环)
    • control/udp_task_pool.go:167time.Sleep(10ms)
    • control/udp_task_pool.go:177tryDeleteQueue

2.3 goroutine pprof

  • (*UdpTaskQueue).convoy goroutine 约 670,为总 goroutine 主体。
  • goroutine 快照中大量栈停在 control/udp_task_pool.go:151,与 CPU 调度热点一致。

3. 根因分析

3.1 直接缺陷(根因)

  • control/udp_task_pool.gotryDeleteQueue 当前实现:
    • LoadAndDelete(key),后比较 expected 指针。
    • 代码由 a795323 引入(refactor(control): optimize memory alignment and improve task queue management,2026-02-17)。
  • 问题点:
    • 该逻辑不是原子“按期望值删除”。
    • 并发下可能删掉“新队列映射”,随后比较失败返回 false,但删除已发生。
    • 结果是队列映射状态异常,旧/新 convoy 都可能进入长期空转循环,导致 goroutine 与调度开销持续偏高。

3.2 放大因素(触发本次回归)

  • 41d20f2refactor(control): optimize UDP handling for QUIC packets and enhance task pool management,2026-02-25)引入两项变化,显著放大上述缺陷影响:
    • UdpTaskPool 触发条件改为 IsLikelyQuicInitialPacket 命中时使用;
    • agingTimeDefaultNatTimeout(历史为 30s)改为 100ms
  • IsLikelyQuicInitialPacket 仅基于首字节位判断(component/sniffing/quic.go),理论命中空间约 16/256 = 6.25%,存在较高“伪命中”概率,导致更多队列创建/回收尝试。
  • 100ms 老化周期使 convoy 更频繁进入清理分支,进一步放大删除竞态造成的调度成本。

4. commit 溯源结论

  • 引入缺陷的 commita795323(2026-02-17)
    • 证据:git blame control/udp_task_pool.go:263-270 指向 a795323,首次引入 LoadAndDelete + pointer compare 方案。
  • 触发/放大回归的 commit41d20f2(2026-02-25)
    • 证据:git blame control/udp_task_pool.go:20-25 指向 41d20f2,引入 UdpTaskPoolAgingTime = 100ms
    • 同提交将 control_plane 改为部分 UDP 路径重新走 UdpTaskPool

5. 建议修复方案

5.1 必选修复(先做)

  • tryDeleteQueue 改为真正原子语义:
    • 使用 sync.Map.CompareAndDelete(key, expected)(Go 版本满足);
    • 禁止“先删后比对”。
  • 目标:消除错误删除新映射导致的队列异常与 convoy 常驻空转。

5.2 建议修复(次步)

  • 收紧 IsLikelyQuicInitialPacket 快速判定,降低伪命中:
    • 增加更强头部约束(不仅首字节位)。
  • 评估 agingTime=100ms 是否过激进:
    • 可回调到更保守值(如秒级)或采用自适应策略。

5.3 可观测性补充

  • 增加 udp_task_pool 相关 metrics:
    • 当前队列数、创建数、删除成功数、删除失败数、convoy 活跃数。
  • 便于回归版本对比与告警。

6. 修复后验收标准

  • go_goroutines(*UdpTaskQueue).convoy 数量明显下降,不再长期维持数百级空转。
  • CPU 回落接近历史基线(业务负载相近时接近 1%~3% 区间)。
  • pprof 中 UdpTaskQueue.convoy 与调度热点占比显著下降。

Opus-reviewed

# Opus Review: UDP TaskPool CPU Regression Analysis

Reviewed: 2026-02-27
Original doc: .plan/2026-02-27-udp-taskpool-cpu-regression-analysis.md
Status: Analysis largely correct; additional race conditions and nuances identified below.


1. Verification of Live Evidence

All key claims independently confirmed via live pprof (captured 2026-02-27 ~10:00 CST):

Metric Documented Live Measurement Match
Total goroutines ~1035-1084 1077
convoy goroutines ~670 705 (607 in select state)
convoy CPU (cum) ~31.71% 34.41% (320ms/930ms)
Line 151 select hotspot dominant 160ms (17.2% of total)
Line 167 time.Sleep(10ms) visible 40ms (4.3%)
Line 182 safeTimerReset visible 70ms (7.5%)
selectgo runtime overhead high 80ms flat, 180ms cum (19.4%)

Key observations:

  • 705 convoy goroutines are in select state without explicit wait duration — meaning they are being scheduled frequently (every ~100ms timer tick), not sleeping.
  • runtime.selectgo at 80ms flat is the single largest flat-time consumer — direct evidence of scheduler thrashing from hundreds of convoy goroutines waking on timer.
  • runtime.schedule/findRunnable at 370-380ms cumulative confirms scheduler overhead from excessive runnable goroutines.

2. Root Cause Validation

2.1 tryDeleteQueue Race — CONFIRMED, with additional detail

The documented race is correct. Here is the precise sequence:

Timeline for key K:

1. Old convoy (Q1): timer fires → refs==0, ch empty
   → sets draining=true → sleeps 10ms → final check passes
   → calls tryDeleteQueue(K, Q1)

2. Meanwhile, acquireQueue(K) runs:
   → Load(K) returns Q1 → sees draining==true → goto createNew
   → LoadOrStore(K, Q2) → succeeds (stores Q2, returns loaded=false)
   → starts new convoy goroutine for Q2

3. tryDeleteQueue(K, Q1) executes:
   → LoadAndDelete(K) → removes Q2 from map, returns Q2
   → Q2 != Q1 → returns false

Result:
- Q2 is removed from the map, but Q2's convoy goroutine is running
- Q2's convoy can NEVER exit: tryDeleteQueue always returns false
  (key no longer in map → loaded=false → returns false)
- Q2 loops forever: timer fires (100ms) → draining → sleep(10ms) →
  tryDeleteQueue fails → reset timer → repeat

This is a goroutine leak with O(n) accumulation — each false-positive QUIC match for a unique AddrPort can produce an orphaned convoy.

2.2 Additional Race in acquireQueue Draining Path — NOT documented

The original analysis missed a second race at udp_task_pool.go:246:

if q.draining.Load() {
    p.queues.Delete(key)    // ← non-atomic delete of whatever is currently at key
    goto createNew
}

Scenario: Two goroutines (A, B) both call acquireQueue(K) and both see draining=true:

1. A: LoadOrStore(K) → loaded=true, returns Q1 (draining)
2. B: LoadOrStore(K) → loaded=true, returns Q1 (draining)
3. A: Delete(K)     — removes Q1
4. C: LoadOrStore(K, Q3) → stored! Q3 is now in map, C starts convoy for Q3
5. B: Delete(K)     — removes Q3 (intended to remove Q1!)
6. B: goto createNew → LoadOrStore(K, Q4) → stored, starts convoy for Q4

Result: Q3's convoy is orphaned (same mechanism as 2.1)

This is a lower-probability race (requires two concurrent acquireQueue calls for same key during draining window), but it compounds the primary issue.

Fix: Use sync.Map.CompareAndDelete(key, q) here too — only delete if the value is still the draining queue we observed.

2.3 IsLikelyQuicInitialPacket False Positive Rate — Analysis CORRECT but understated

The documented rate of 6.25% (16/256) is mathematically correct:

  • Required: bit7=1 (long header), bit6=1 (fixed bit), bits5:4=00 (Initial type)
  • Matching first bytes: 0xC0-0xCF

What the analysis understates: this affects ALL UDP traffic, not just QUIC. DNS responses, game packets, VoIP, etc. all flow through this check. With dns_udp4_info=936 in 15 minutes (~1/sec), even DNS alone produces ~4 false positives per minute. Each creates a queue+convoy that should age out in 100ms — but due to the tryDeleteQueue race, a fraction become permanent orphans.

The 705 orphaned convoys represent the cumulative leak since the last restart (~17 hours based on goroutine 19 age of 1028 minutes).

2.4 agingTime=100ms Impact — Analysis CORRECT

The 100ms aging time amplifies the race in two ways:

  1. More frequent timer firings → more opportunities to hit the tryDeleteQueue race
  2. The 10ms sleep in the draining path is 10% of the aging time, creating a larger window for concurrent acquireQueue calls

3. Commit Attribution — CONFIRMED

Commit Role Evidence
a795323 (2026-02-17) Introduced defect: LoadAndDelete + pointer compare pattern git blame on tryDeleteQueue
41d20f2 (2026-02-25) Triggered regression: 100ms aging + broader QUIC path usage git blame on UdpTaskPoolAgingTime

4. Fix Recommendations — Review & Amendments

4.1 MUST FIX: tryDeleteQueueCompareAndDelete ✅ Agree

func (p *UdpTaskPool) tryDeleteQueue(key netip.AddrPort, expected *UdpTaskQueue) bool {
    return p.queues.CompareAndDelete(key, expected)
}

sync.Map.CompareAndDelete is available since Go 1.20; project uses Go 1.26. Pointer comparison via any interface equality works correctly.

4.2 MUST FIX (MISSING): acquireQueue draining path

The p.queues.Delete(key) at line 246 should also use CompareAndDelete:

if q.draining.Load() {
    p.queues.CompareAndDelete(key, q)  // only delete if still the draining queue
    goto createNew
}

4.3 SHOULD FIX: Tighten IsLikelyQuicInitialPacket ✅ Agree

Add version field check (bytes 1-4). All deployed QUIC versions use specific version numbers:

  • QUIC v1: 0x00000001
  • QUIC v2: 0x6b3343cf
  • Version negotiation: 0x00000000

Even checking that buf[1:5] matches a known version set would cut false positives dramatically.

4.4 SHOULD EVALUATE: agingTime=100ms

100ms is aggressive. Consider:

  • 1s default with configurable override
  • Or adaptive: start at 100ms for known-QUIC (version-verified), use longer timeout for unverified matches

4.5 OPTIONAL: Convoy goroutine exit safety net

Add a maximum lifetime or iteration count to convoy's cleanup loop as defense-in-depth. If tryDeleteQueue fails N times consecutively with no tasks received, force-exit. This prevents permanent leaks from any future race conditions.


5. Validation Criteria — Agree with additions

Original 6D40 criteria are correct. Add:

  • No convoy goroutine should exist without a corresponding entry in queues sync.Map (can verify via pprof goroutine dump + metrics endpoint udp_task_pool_count)
  • After fix deployment, convoy goroutine count should track queue count closely (within single-digit delta at any point)
  • CPU should drop to baseline within minutes, not gradually (the fix eliminates the leak, existing orphans exit on restart)

6. Summary Assessment

Aspect Rating Notes
Problem identification Excellent Correct root cause, correct commit attribution
Evidence quality Excellent pprof CPU + goroutine + metrics all consistent
Race analysis Good Primary race correct; missed secondary race in acquireQueue
Fix proposal Good CompareAndDelete is the right fix; missing acquireQueue fix
Severity assessment Accurate Real regression, accumulates over time, will worsen

Bottom line: The analysis is sound and actionable. Apply both CompareAndDelete fixes (tryDeleteQueue + acquireQueue draining path), tighten QUIC detection, and the regression will be resolved.

@olicesx
Copy link
Copy Markdown
Contributor Author
olicesx commented Feb 27, 2026

@olicesx
供参考

udp-taskpool-cpu-regression summary

已确认

  • 705/1077 goroutine 是 convoy,CPU 占比 34.4%(cumulative),与文档一致
  • tryDeleteQueue 的 LoadAndDelete + 指针比较确实不是原子 CAS — 会删掉新队列并导致 goroutine 永久泄漏
  • commit 归因正确:a795323 引入缺陷,41d20f2 触发放大
  • selectgo 80ms flat 是最大的单项 flat 开销 — 数百个 convoy 每 100ms 被调度一次导致的调度器抖动

新发现(原文未覆盖)

  • acquireQueue 第 246 行的 p.queues.Delete(key) 存在同类竞态 — 两个 goroutine 同时 acquireQueue 且都看到 draining=true 时,后者的 Delete 可能误删新队列。此处也需要改为
    CompareAndDelete
  • 泄漏的 convoy goroutine 永远无法退出(key 不在 map 中 → tryDeleteQueue 始终返回 false → 无限循环),705 个是重启以来 ~17 小时的累计泄漏量

修复建议

  1. 必做:tryDeleteQueue → sync.Map.CompareAndDelete(Go 1.20+,项目用 1.26)
  2. 必做:acquireQueue draining 路径的 Delete 也改为 CompareAndDelete
  3. 建议:收紧 QUIC 检测(增加 version 字段校验),降低 6.25% 的误命中率
  4. 建议:评估 100ms agingTime 是否过于激进

2026-02-27-udp-taskpool-cpu-regression-analysis

# dae CPU 回归排查记录(2026-02-27)

1. 背景与现象

  • 观察到在 2026-02-26 16:52:43 部署 unstable-202602 9E88 26.pr-26.r44.0da556 后,CPU usage 从常态 1%~3% 上升到约 6%~10%
  • 业务负载体感无明显增加,属于版本后回归现象。
  • 当前 pprof 访问端口为 :5556endpoint_listen_address),并非 pprof_port 直绑 :6060

2. 关键证据(本次排查)

2.1 在线指标

  • go_goroutines1035~1084
  • process_cpu_seconds_total 30 秒窗口估算 CPU 约 9.47%
  • 15 分钟日志统计(示例窗口):
    • dns_udp4_info=936
    • dns_tcp4_info=97
    • udp_tuple_miss=5
    • 其他高噪 debug/warn(如 rewrite/conn_check/udp_readloop)在当前窗口接近 0。

2.2 CPU pprof(2026-02-27 09:42:34)

  • github.com/daeuniverse/dae/control.(*UdpTaskQueue).convoy 累计约 31.71%
  • runtime 调度热点显著:
    • runtime.schedule/findRunnable/selectgo 占比较高。
  • convoy 行级热点集中在:
    • control/udp_task_pool.go:151select 循环)
    • control/udp_task_pool.go:167time.Sleep(10ms)
    • control/udp_task_pool.go:177tryDeleteQueue

2.3 goroutine pprof

  • (*UdpTaskQueue).convoy goroutine 约 670,为总 goroutine 主体。
  • goroutine 快照中大量栈停在 control/udp_task_pool.go:151,与 CPU 调度热点一致。

3. 根因分析

3.1 直接缺陷(根因)

  • control/udp_task_pool.gotryDeleteQueue 当前实现:
    • LoadAndDelete(key),后比较 expected 指针。
    • 代码由 a795323 引入(refactor(control): optimize memory alignment and improve task queue management,2026-02-17)。
  • 问题点:
    • 该逻辑不是原子“按期望值删除”。
    • 并发下可能删掉“新队列映射”,随后比较失败返回 false,但删除已发生。
    • 结果是队列映射状态异常,旧/新 convoy 都可能进入长期空转循环,导致 goroutine 与调度开销持续偏高。

3.2 放大因素(触发本次回归)

  • 41d20f2refactor(control): optimize UDP handling for QUIC packets and enhance task pool management,2026-02-25)引入两项变化,显著放大上述缺陷影响:
    • UdpTaskPool 触发条件改为 IsLikelyQuicInitialPacket 命中时使用;
    • agingTimeDefaultNatTimeout(历史为 30s)改为 100ms
  • IsLikelyQuicInitialPacket 仅基于首字节位判断(component/sniffing/quic.go),理论命中空间约 16/256 = 6.25%,存在较高“伪命中”概率,导致更多队列创建/回收尝试。
  • 100ms 老化周期使 convoy 更频繁进入清理分支,进一步放大删除竞态造成的调度成本。

4. commit 溯源结论

  • 引入缺陷的 commita795323(2026-02-17)
    • 证据:git blame control/udp_task_pool.go:263-270 指向 a795323,首次引入 LoadAndDelete + pointer compare 方案。
  • 触发/放大回归的 commit41d20f2(2026-02-25)
    • 证据:git blame control/udp_task_pool.go:20-25 指向 41d20f2,引入 UdpTaskPoolAgingTime = 100ms
    • 同提交将 control_plane 改为部分 UDP 路径重新走 UdpTaskPool

5. 建议修复方案

5.1 必选修复(先做)

  • tryDeleteQueue 改为真正原子语义:
    • 使用 sync.Map.CompareAndDelete(key, expected)(Go 版本满足);
    • 禁止“先删后比对”。
  • 目标:消除错误删除新映射导致的队列异常与 convoy 常驻空转。

5.2 建议修复(次步)

  • 收紧 IsLikelyQuicInitialPacket 快速判定,降低伪命中:
    • 增加更强头部约束(不仅首字节位)。
  • 评估 agingTime=100ms 是否过激进:
    • 可回调到更保守值(如秒级)或采用自适应策略。

5.3 可观测性补充

  • 增加 udp_task_pool 相关 metrics:
    • 当前队列数、创建数、删除成功数、删除失败数、convoy 活跃数。
  • 便于回归版本对比与告警。

6. 修复后验收标准

  • go_goroutines(*UdpTaskQueue).convoy 数量明显下降,不再长期维持数百级空转。
  • CPU 回落接近历史基线(业务负载相近时接近 1%~3% 区间)。
  • pprof 中 UdpTaskQueue.convoy 与调度热点占比显著下降。

Opus-reviewed

# Opus Review: UDP TaskPool CPU Regression Analysis

Reviewed: 2026-02-27
Original doc: .plan/2026-02-27-udp-taskpool-cpu-regression-analysis.md
Status: Analysis largely correct; additional race conditions and nuances identified below.


1. Verification of Live Evidence

All key claims independently confirmed via live pprof (captured 2026-02-27 ~10:00 CST):

Metric Documented Live Measurement Match
Total goroutines ~1035-1084 1077
convoy goroutines ~670 705 (607 in select state)
convoy CPU (cum) ~31.71% 34.41% (320ms/930ms)
Line 151 select hotspot dominant 160ms (17.2% of total)
Line 167 time.Sleep(10ms) visible 40ms (4.3%)
Line 182 safeTimerReset visible 70ms (7.5%)
selectgo runtime overhead high 80ms flat, 180ms cum (19.4%)

Key observations:

  • 705 convoy goroutines are in select state without explicit wait duration — meaning they are being scheduled frequently (every ~100ms timer tick), not sleeping.
  • runtime.selectgo at 80ms flat is the single largest flat-time consumer — direct evidence of scheduler thrashing from hundreds of convoy goroutines waking on timer.
  • runtime.schedule/findRunnable at 370-380ms cumulative confirms scheduler overhead from excessive runnable goroutines.

2. Root Cause Validation

2.1 tryDeleteQueue Race — CONFIRMED, with additional detail

The documented race is correct. Here is the precise sequence:

Timeline for key K:

1. Old convoy (Q1): timer fires → refs==0, ch empty
   → sets draining=true → sleeps 10ms → final check passes
   → calls tryDeleteQueue(K, Q1)

2. Meanwhile, acquireQueue(K) runs:
   → Load(K) returns Q1 → sees draining==true → goto createNew
   → LoadOrStore(K, Q2) → succeeds (stores Q2, returns loaded=false)
   → starts new convoy goroutine for Q2

3. tryDeleteQueue(K, Q1) executes:
   → LoadAndDelete(K) → removes Q2 from map, returns Q2
   → Q2 != Q1 → returns false

Result:
- Q2 is removed from the map, but Q2's convoy goroutine is running
- Q2's convoy can NEVER exit: tryDeleteQueue always returns false
  (key no longer in map → loaded=false → returns false)
- Q2 loops forever: timer fires (100ms) → draining → sleep(10ms) →
  tryDeleteQueue fails → reset timer → repeat

This is a goroutine leak with O(n) accumulation — each false-positive QUIC match for a unique AddrPort can produce an orphaned convoy.

2.2 Additional Race in acquireQueue Draining Path — NOT documented

The original analysis missed a second race at udp_task_pool.go:246:

if q.draining.Load() {
    p.queues.Delete(key)    // ← non-atomic delete of whatever is currently at key
    goto createNew
}

Scenario: Two goroutines (A, B) both call acquireQueue(K) and both see draining=true:

1. A: LoadOrStore(K) → loaded=true, returns Q1 (draining)
2. B: LoadOrStore(K) → loaded=true, returns Q1 (draining)
3. A: Delete(K)     — removes Q1
4. C: LoadOrStore(K, Q3) → stored! Q3 is now in map, C starts convoy for Q3
5. B: Delete(K)     — removes Q3 (intended to remove Q1!)
6. B: goto createNew → LoadOrStore(K, Q4) → stored, starts convoy for Q4

Result: Q3's convoy is orphaned (same mechanism as 2.1)

This is a lower-probability race (requires two concurrent acquireQueue calls for same key during draining window), but it compounds the primary issue.

Fix: Use sync.Map.CompareAndDelete(key, q) here too — only delete if the value is still the draining queue we observed.

2.3 IsLikelyQuicInitialPacket False Positive Rate — Analysis CORRECT but understated

The documented rate of 6.25% (16/256) is mathematically correct:

  • Required: bit7=1 (long header), bit6=1 (fixed bit), bits5:4=00 (Initial type)
  • Matching first bytes: 0xC0-0xCF

What the analysis understates: this affects ALL UDP traffic, not just QUIC. DNS responses, game packets, VoIP, etc. all flow through this check. With dns_udp4_info=936 in 15 minutes (~1/sec), even DNS alone produces ~4 false positives per minute. Each creates a queue+convoy that should age out in 100ms — but due to the tryDeleteQueue race, a fraction become permanent orphans.

The 705 orphaned convoys represent the cumulative leak since the last restart (~17 hours based on goroutine 19 age of 1028 minutes).

2.4 agingTime=100ms Impact — Analysis CORRECT

The 100ms aging time amplifies the race in two ways:

  1. More frequent timer firings → more opportunities to hit the tryDeleteQueue race
  2. The 10ms sleep in the draining path is 10% of the aging time, creating a larger window for concurrent acquireQueue calls

3. Commit Attribution — CONFIRMED

Commit Role Evidence
a795323 (2026-02-17) Introduced defect: LoadAndDelete + pointer compare pattern git blame on tryDeleteQueue
41d20f2 (2026-02-25) Triggered regression: 100ms aging + broader QUIC path usage git blame on UdpTaskPoolAgingTime

4. Fix Recommendations — Review & Amendments

4.1 MUST FIX: tryDeleteQueueCompareAndDelete ✅ Agree

func (p *UdpTaskPool) tryDeleteQueue(key netip.AddrPort, expected *UdpTaskQueue) bool {
    return p.queues.CompareAndDelete(key, expected)
}

sync.Map.CompareAndDelete is available since Go 1.20; project uses Go 1.26. Pointer comparison via any interface equality works correctly.

4.2 MUST FIX (MISSING): acquireQueue draining path

The p.queues.Delete(key) at line 246 should also use CompareAndDelete:

if q.draining.Load() {
    p.queues.CompareAndDelete(key, q)  // only delete if still the draining queue
    goto createNew
}

4.3 SHOULD FIX: Tighten IsLikelyQuicInitialPacket ✅ Agree

Add version field check (bytes 1-4). All deployed QUIC versions use specific version numbers:

  • QUIC v1: 0x00000001
  • QUIC v2: 0x6b3343cf
  • Version negotiation: 0x00000000

Even checking that buf[1:5] matches a known version set would cut false positives dramatically.

4.4 SHOULD EVALUATE: agingTime=100ms

100ms is aggressive. Consider:

  • 1s default with configurable override
  • Or adaptive: start at 100ms for known-QUIC (version-verified), use longer timeout for unverified matches

4.5 OPTIONAL: Convoy goroutine exit safety net

Add a maximum lifetime or iteration count to convoy's cleanup loop as defense-in-depth. If tryDeleteQueue fails N times consecutively with no tasks received, force-exit. This prevents permanent leaks from any future race conditions.


5. Validation Criteria — Agree with additions

Original criteria are correct. Add:

  • No convoy goroutine should exist without a corresponding entry in queues sync.Map (can verify via pprof goroutine dump + metrics endpoint udp_task_pool_count)
  • After fix deployment, convoy goroutine count should track queue count closely (within single-digit delta at any point)
  • CPU should drop to baseline within minutes, not gradually (the fix eliminates the leak, existing orphans exit on restart)

6. Summary Assessment

Aspect Rating Notes
Problem identification Excellent Correct root cause, correct commit attribution
Evidence quality Excellent pprof CPU + goroutine + metrics all consistent
Race analysis Good Primary race correct; missed secondary race in acquireQueue
Fix proposal Good CompareAndDelete is the right fix; missing acquireQueue fix
Severity assessment Accurate Real regression, accumulates over time, will worsen

Bottom line: The analysis is sound and actionable. Apply both CompareAndDelete fixes (tryDeleteQueue + acquireQueue draining path), tighten QUIC detection, and the regression will be resolved.

感谢审阅

@MaurUppi
Copy link
Copy Markdown
MaurUppi commented Feb 27, 2026

@olicesx
供参考

感谢审阅

感谢你的一直贡献才对

测试/构建细节见:MaurUppi#26

EDITED

PR#26 before/after 报告 <-- 追加 24h 后的 同口径复测(两段窗口 + pprof)

udp-taskpool CPU 回归修复前后对比监测(2026-02-27)

1. 监测目标与对象

  • 对象:PR#26 修订版部署后(dae2026-02-27 11:37:37 CST 运行)
  • 目标:对比修复前后 CPU / goroutine / UdpTaskQueue.convoy 行为,验证回归是否消除
  • 参考基线:.plan/2026-02-27-udp-taskpool-cpu-regression-analysis.md

2. 修复前基线(Before)

时间点:2026-02-27 09:42:34 CST(旧版本排查时)

  • go_goroutines: 约 1035~1084
  • 30s 窗口 CPU 估算:约 9.47%
  • goroutine pprof:(*UdpTaskQueue).convoy670
  • CPU pprof:(*UdpTaskQueue).convoy cumulative 约 31.71%

3. 修复后监测(After)

说明:按“两段监测”执行。
运行中发现 metrics 路径为 /metrics/debug/metrics 返回 404),第二段已按 /metrics 重采并校正。

3.1 第一段(启动后早期窗口)

监测时间段:2026-02-27 12:02:52 CST2026-02-27 12:14:56 CST

index timestamp go_goroutines go_threads process_cpu_seconds_total rss_bytes convoy_goroutines
0 2026-02-27 12:02:52 CST 323 13 33.07 8.7605248e+07 2
1 2026-02-27 12:04:53 CST 357 13 35.75 9.0226688e+07 2
2 2026-02-27 12:06:53 CST 392 13 38.85 8.9423872e+07 2
3 2026-02-27 12:08:54 CST 710 13 43.11 9.4928896e+07 2
4 2026-02-27 12:10:55 CST 573 13 48.65 9.5891456e+07 7
5 2026-02-27 12:12:55 CST 540 13 58.31 9.9557376e+07 9
6 2026-02-27 12:14:56 CST 567 13 66.99 1.04476672e+08 14

汇总:

  • go_goroutines 平均 494.57,峰值 710
  • convoy_goroutines 平均 5.43,峰值 14
  • CPU 估算(Δprocess_cpu_seconds_total / Δwall * 100):4.69%

pprof(第一段末尾,2026-02-27 12:15:10~12:15:10 CST):

  • CPU profile:总样本 780ms(30s,约 2.60%
    • (*UdpTaskQueue).convoy cumulative 50ms6.41%
  • Goroutine profile:总 546,其中 convoy 14
  • Heap inuse:约 30.4MB

3.2 第二段(启动后 1h+ 窗口)

监测时间段:2026-02-27 12:50:17 CST2026-02-27 13:02:19 CST

index timestamp go_goroutines go_threads process_cpu_seconds_total rss_bytes convoy_goroutines
0 2026-02-27 12:50:17 CST 461 13 125.56 1.01920768e+08 18
1 2026-02-27 12:52:17 CST 389 13 128.57 9.6444416e+07 19
2 2026-02-27 12:54:18 CST 412 13 132.37 9.7374208e+07 19
3 2026-02-27 12:56:18 CST 382 13 135.21 9.80992e+07 19
4 2026-02-27 12:58:18 CST 381 13 137.94 9.2954624e+07 21
5 2026-02-27 13:00:19 CST 372 13 141.89 9.0628096e+07 22
6 2026-02-27 13:02:19 CST 397 13 145.14 9.1152384e+07 22

汇总:

  • go_goroutines 平均 399.14,峰值 461
  • convoy_goroutines 平均 20.00,峰值 22
  • CPU 估算(Δprocess_cpu_seconds_total / Δwall * 100):2.71%

pprof(第二段末尾,2026-02-27 13:02:28~13:02:58 CST):

  • CPU profile:总样本 800ms(30s,约 2.67%
    • (*UdpTaskQueue).convoy cumulative 50ms6.25%
  • Goroutine profile:总 416,其中 convoy 22
  • Heap inuse:约 29.4MB

4. Before/After 结论

4.1 回归修复有效(核心指标)

  • CPU:~9.47%(Before)下降到 4.69%(After-Phase1)并进一步到 2.71%(After-Phase2)
  • 总 goroutine:~1035~1084(Before)下降到约 399~495 平均区间
  • convoy goroutine:~670(Before)下降到 14(Phase1 末)/ 22(Phase2 末)
  • convoy CPU 累计占比:31.71%(Before)下降到 ~6.3%

4.2 当前状态判断

  • 未再观察到“数百个 convoy 常驻 + 调度器抖动”的旧故障形态。
  • 第二段 convoy 从 18 增至 22,属于低位波动,量级远低于修复前;当前证据不足以判定新的泄漏。

5. 后续建议

  1. 持续运行 24h 做一次同口径复测(同样两段窗口 + pprof),确认 convoy 是否稳定在低双位数以内。
  2. udp_task_pool 增加专门 metrics(队列数、创建/删除成功、删除失败、convoy 活跃数),减少后续只能靠 pprof grep 的观测盲区。
  3. 若后续仍见 CPU 波动,可在相同负载下做 A/B:临时降低 metrics 拉取频率,排除采集本身造成的用户态开销噪声。

6. 2026-02-28 同口径复测(两段窗口 + pprof)

复测目标:验证第 5 节建议中的关键判断,即 convoy 是否能稳定在低双位数。

口径说明(与第 3 节一致):

  • 每段窗口 7 个采样点(约 12 分钟,2 分钟间隔)
  • 每段窗口末尾抓取 30 秒 CPU pprof、goroutine pprof、heap pprof
  • 监测入口:/metrics/debug/pprof/*http://127.0.0.1:5556,通过 ssh dae 采集)
  • 两段窗口间隔:30 分钟

原始采样文件:.plan/udp-taskpool-cpu-regression/retest-20260228-105118/

6.1 第一段窗口

监测时间段:2026-02-28 10:51:18 CST2026-02-28 11:03:22 CST

index timestamp go_goroutines go_threads process_cpu_seconds_total rss_bytes convoy_goroutines
0 2026-02-28 10:51:18 CST 967 14 3416.64 1.355776e+08 360
1 2026-02-28 10:53:18 CST 977 14 3425.39 1.37412608e+08 546
2 2026-02-28 10:55:19 CST 985 14 3434.36 1.38723328e+08 592
3 2026-02-28 10:57:20 CST 969 14 3443.3 1.38190848e+08 593
4 2026-02-28 10:59:20 CST 930 14 3451.83 1.4422016e+08 549
5 2026-02-28 11:01:21 CST 952 14 3460.12 1.4422016e+08 595
6 2026-02-28 11:03:22 CST 989 14 3469.04 1.44351232e+08 319

汇总:

  • go_goroutines 平均 967.00,峰值 989
  • convoy_goroutines 平均 507.71,峰值 595
  • CPU 估算(Δprocess_cpu_seconds_total / Δwall * 100):7.24%

pprof(第一段末尾,2026-02-28 11:03:22~11:03:53 CST):

  • CPU profile:总样本 2.08s(30s,约 6.93%
    • (*UdpTaskQueue).convoy cumulative 0.85s40.87%
  • Goroutine profile:总 990,其中 convoy 588
  • Heap inuse:约 50.70MB

6.2 第二段窗口

监测时间段:2026-02-28 11:33:55 CST2026-02-28 11:46:01 CST

index timestamp go_goroutines go_threads process_cpu_seconds_total rss_bytes convoy_goroutines
0 2026-02-28 11:33:55 CST 1089 14 3611.72 1.44420864e+08 485
1 2026-02-28 11:35:56 CST 1088 14 3620.69 1.46649088e+08 606
2 2026-02-28 11:37:57 CST 1022 14 3629.28 1.46649088e+08 606
3 2026-02-28 11:39:58 CST 956 14 3638.99 1.46649088e+08 356
4 2026-02-28 11:41:59 CST 1042 14 3648.21 1.46509824e+08 590
5 2026-02-28 11:44:00 CST 1051 14 3657.28 1.46509824e+08 583
6 2026-02-28 11:46:01 CST 1072 14 3666.1 1.47165184e+08 610

汇总:

  • go_goroutines 平均 1045.71,峰值 1089
  • convoy_goroutines 平均 548.00,峰值 610
  • CPU 估算(Δprocess_cpu_seconds_total / Δwall * 100):7.49%

pprof(第二段末尾,2026-02-28 11:46:01~11:46:33 CST):

  • CPU profile:总样本 1.88s(30s,约 6.27%
    • (*UdpTaskQueue).convoy cumulative 0.71s37.77%
  • Goroutine profile:总 1068,其中 convoy 311
  • Heap inuse:约 49.27MB49265.42kB

6.3 对“低双位数稳定性”的复测结论

  • 结论:未通过。本轮两段窗口中 convoy_goroutines 长时间维持在数百级(均值 507.71 / 548.00),并未稳定在低双位数。
  • 与 2026-02-27 第 3.2 节(第二段)相比:
    • convoy_goroutines 均值:20.00 -> 548.00
    • go_goroutines 均值:399.14 -> 1045.71
    • CPU 估算:2.71% -> 7.49%
    • convoy CPU cumulative:6.25% -> 37.77%

7. 新发现与下一步(Systematic Debugging)

基于本次复测证据,当前更接近“convoy 再次放大/持续存在”的故障形态,而非“低位稳定波动”。

建议下一步按同一流程继续 Phase 1(仅取证,不先改代码):

  1. 固定一个时间点,采集运行实例版本与启动时间(确认是否与预期 commit 一致)。
  2. 在相同采样点同步抓取 udp_task_pool_count 与 9E88 新增 compare-and-delete 相关指标(若已暴露),核对“队列数 vs convoy 数”是否背离。
  3. 对比本轮与 2026-02-27 的流量模式差异(DNS QPS、连接建立速率、策略命中分布),排除负载侧放大因素。

8. 2026-02-28 追加取证:剩余“convoy 常驻”路径验证(不改代码)

8.1 取证目标

  • 验证在已合入 CAS 修复后,当前实现中是否仍存在可导致 convoy 常驻的执行路径。
  • 本节仅做 Phase 1 证据收集,不包含代码修复。

8.2 关键实时证据(线上)

采样时间:2026-02-28 12:26~12:32 CSTssh dae 拉取 /metrics + goroutine pprof)

单点样本:

  • dae_udp_task_queues_active = 0
  • go_goroutines ≈ 1052
  • goroutine pprof:
    • convoy_line151control/udp_task_pool.go:151)约 599
    • convoy_line167control/udp_task_pool.go:167)约 25

短时序样本(共 8 点,节选):

timestamp queue_active go_goroutines convoy_line151 convoy_line167
2026-02-28 12:26:39 CST 0 1052 624 0
2026-02-28 12:27:09 CST 0 1061 533 91
2026-02-28 12:27:39 CST 0 1055 624 0
2026-02-28 12:28:10 CST 0 1030 523 102
2026-02-28 12:28:40 CST 0 1066 625 0
2026-02-28 12:31:28 CST 0 1072 549 77
2026-02-28 12:31:43 CST 0 1070 626 0
2026-02-28 12:31:58 CST 0 1065 626 0

观测结论:

  • queue_active 持续为 0,但 convoyline151 持续数百,不是瞬时抖动。
  • 说明存在“未在 DefaultUdpTaskPool.queues 中、但仍存活”的 convoy goroutine 群体。

8.3 与代码路径的对应关系

相关代码点:

  • 任务入池触发点(QUIC 初包):control/control_plane.go:1194-1200
  • convoy 主循环与清理路径:control/udp_task_pool.go:151-183
  • acquireQueue draining 分支:control/udp_task_pool.go:245-247
  • tryDeleteQueue(CAS):control/udp_task_pool.go:265-266

可导致常驻的剩余路径(基于代码与上述指标背离):

  1. convoy 进入 timer 清理分支,设置 draining=trueSleep(10ms)
  2. 并发 acquireQueue 命中同一 key 且观察到 draining=true,在 acquireQueue 分支执行 p.tryDeleteQueue(key, q),可能先一步将该队列从 map 删除。
  3. convoy 醒来后执行 q.p.tryDeleteQueue(q.key, q) 失败(map 已无该 key 或值已变化)。
  4. 失败路径当前仅 draining=false + reset timer + continue,没有“map 中已不再是自己”时的退出条件,导致 convoy 可长期常驻轮询。

8.4 本节结论(Phase 1)

  • 已拿到“指标背离 + 栈位点 + 代码路径”三类证据,支持存在剩余的 convoy 常驻路径。
  • 当前证据已足以进入下一步最小化验证(Phase 2/3):围绕“tryDeleteQueue 失败后的退出条件”做单变量实验验证。

9. Phase 2/3:单变量最小化验证(tryDeleteQueue 失败后的退出条件)

9.1 假设(单一)

  • 假设:当 convoy 在清理阶段调用 tryDeleteQueue 失败,且 map 中已不存在该 key -> q 映射时,当前实现会继续循环而不是退出,形成“脱离 map 的常驻 convoy”。
  • 依据:第 8 节已经观测到 queue_active=0convoy(line151) 数百并存。

9.2 变量控制

  • 保持不变:
    • 生产代码逻辑不改动(仅新增测试)。
    • 队列老化机制、convoy 主循环路径不改。
  • 仅操纵一个变量:
    • convoy 进入 draining 后,先由“并发路径”删除 map 中的 key -> q 映射,再让 convoy 执行其自身删除。

9.3 实验实现

  • 新增诊断测试:control/udp_task_pool_phase3_validation_test.go
  • 测试名:TestPhase3_ConvoyPersistsWhenQueueMappingDeletedBeforeSelfDelete
  • 核心步骤:
    1. 启动一个 convoy 并等待其进入 draining=true
    2. 先执行一次 pool.tryDeleteQueue(key, q),确保 map 计数变为 0。
    3. 等待 convoy 醒来尝试自删。
    4. 断言:convoy 未退出(仍存活),但 map 计数已为 0。
    5. 为了测试收尾,恢复 key -> q 映射,使其下一轮可正常退出。

9.4 执行与结果

执行时间:2026-02-28 12:56~12:57 CST

执行命令(Linux 容器口径):

docker run --rm -e GOTOOLCHAIN=auto -v "$PWD":/src -w /src golang:1.25 \
  go test ./control -run TestPhase3_ConvoyPersistsWhenQueueMappingDeletedBeforeSelfDelete -v

结果:PASS

执行命令(同用例 + race):

docker run --rm -e GOTOOLCHAIN=auto -v "$PWD":/src -w /src golang:1.25 \
  go test -race ./control -run TestPhase3_ConvoyPersistsWhenQueueMappingDeletedBeforeSelfDelete -v

结果:PASS

9.5 结论(Phase 2/3)

  • 该单变量实验支持第 9.1 节假设:在“映射被先删”场景下,convoy 可在 map 计数为 0 时继续存活。
  • 这与线上“queue_active=0convoy 仍大量驻留”的现象同向一致。
  • 当前阶段结论:已完成“证据确认”,下一步可进入 Phase 4(在明确预期退出语义后,做最小代码修复与回归测试)。

10. Phase 4:最小修复与回归验证

10.1 期望语义(先固化为测试)

  • 期望:当 convoy 清理时 tryDeleteQueue 失败,且 map 已不再持有 key -> 当前q 映射(不存在或被新队列替换),旧 convoy 必须退出,不能继续轮询常驻。

测试实现:

  • 新增回归测试:control/udp_task_pool_convoy_exit_test.go
  • 用例:TestConvoyExitsWhenQueueMappingDeletedBeforeSelfDelete
  • TDD 红灯确认:在修复前该用例失败(convoy did not exit after mapping was deleted before self-delete)。

10.2 最小代码修复

文件:control/udp_task_pool.goconvoy 的 timer 清理分支)

变更点:

  1. 保持原有成功删除分支不变(tryDeleteQueue==true 时归还 channel 并退出)。
  2. tryDeleteQueue==false 后增加映射归属检查:
    • 若 map 不存在该 key,或 key 对应值已不是当前 q,判定当前 convoy 为 stale;
    • 立即归还 channel 并退出;
    • 不再进入 draining=false + timer reset + continue 的循环常驻路径。

10.3 验证结果

执行时间:2026-02-28 13:04~13:08 CST

  1. 新增回归测试(修复后):
    • go test ./control -run TestConvoyExitsWhenQueueMappingDeletedBeforeSelfDelete -v
    • 结果:PASS
  2. 相关回归集合(与 CI 口径一致扩展):
    • go test ./control -run 'Test(UdpTaskPool|CompareAndDelete|NoGoroutineLeak|Convoy|HighConcurrencyStress)' -count=1
    • 结果:PASS
  3. race 口径:
    • go test -race ./control -run 'Test(UdpTaskPool|CompareAndDelete|NoGoroutineLeak|Convoy|HighConcurrencyStress)' -count=1
    • 结果:PASS
  4. QUIC 检测回归:
    • go test ./component/sniffing -run TestIsLikelyQuicInitialPacket -count=1
    • 结果:PASS

10.4 Phase 4 结论

  • 已完成“失败测试 -> 最小修复 -> 回归通过”的闭环。
  • 修复直接针对“tryDeleteQueue 失败后的退出条件”,与 Phase 1~3 证据链一致,且未引入额外行为面改动。

@MaurUppi
Copy link
Copy Markdown
MaurUppi commented Feb 28, 2026

@olicesx

PR#936 需要补 fix 的 commit 定位

  • 需要补 fix 的是 6c71a20(它做了 CAS 修复,但留下了残余路径)。
  • 短因果:
    • 6c71a20 将删除改为 CompareAndDelete,修复了“误删新队列”问题;
    • 但在 convoy 中,tryDeleteQueue 失败后仍是 draining=false + reset + continue
    • 当映射已被并发删除/替换时,旧 convoy 会脱离 map 常驻循环,导致 convoy 累积和 CPU 抬升。
  • 修复点在 udp_task_pool.go:182,新增“映射不再指向当前 q 时直接退出”。

我自己 fork repo 的 PR: #27

metrics dashboard log

CleanShot 2026-02-28 at 13 50 12 CleanShot 2026-02-28 at 13 48 00 CleanShot 2026-02-28 at 13 48 28 CleanShot 2026-02-28 at 13 49 16

@olicesx olicesx force-pushed the optimize/code-quality-fixes branch from e9685c7 to 7eafc6e Compare March 2, 2026 01:52
qi-mooo pushed a commit to qi-mooo/dae that referenced this pull request Apr 8, 2026
qi-mooo added a commit to qi-mooo/dae that referenced this pull request Apr 8, 2026
kix and others added 23 commits April 12, 2026 18:20
…bility

- Simplified the dnsForwarderKey structure by removing unnecessary dialArgument.
- Added tests for ResetDnsForwarders to ensure in-flight forwarders are handled correctly.
- Enhanced DNSListener to use atomic pointers for the ControlPlane, improving thread safety.
- Updated dnsHandler to utilize the new Controller method for better error handling.
- Introduced new methods in failedQuicDcidCache for managing shard storage and cleanup.
- Improved routing matcher builder to retain state in snapshots and refactored kernspace building logic.
- Added tests to verify the integrity of the routing kernspace snapshot.
- Enhanced UDP handling with new packet sending functions to support advanced features.
The reload preparation path in cmd/run.go uses a 45-second timeout context that was leaking into ControlPlane lifecycle contexts via context.WithCancel(ctx). When the timeout fired, Serve() would exit and all traffic (both direct and proxy) would die.

- Derive all CP-owned contexts from context.Background() instead of the caller's potentially-timed-out ctx
- Add retired atomic.Bool to block stale health-check callback writes during drain
- Add MarkRetired() to both staged and non-staged retirement goroutines
- Add Serve() exit reason logging to distinguish normal vs timeout-driven exits

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
DNS controller workers (bpfUpdateWorker, janitor, evictor) watched baseContext().Done() which changes across reload generations. When the lifecycle context swapped during reload, workers would exit prematurely.

Remove baseContext().Done() watches so workers survive across reloads. Workers are stopped via explicit stop channels closed in DnsController.Close().

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
clearRejectedReloadProgress() hardcoded SignalProgressFilePath for reads, but tests override the writer to use temp files. Add getRunSignalProgress variable so tests can override both read and write paths.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Three DNS router tests use real UDP sockets with SO_MARK which requires CAP_NET_ADMIN. Add skipIfNoSocketMark helper and skip these tests in CI containers that lack the capability.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
RestoreHealthSnapshot unconditionally set reloadInheritedHealth which added a full CheckInterval (~30s) delay before the first health check. When dialers inherited NOT-ALIVE state from the previous generation, they stayed unreachable for 30+ seconds after reload.

Only defer the first health check when ALL inherited collections are ALIVE. NOT-ALIVE dialers need an immediate probe to recover connectivity.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…and limit response reads

- errors: IsUDPEndpointNormalClose(nil) returns false to match companion function
- netutils: comma-ok type assertion for logger from context value
- subscription: cap io.ReadAll with 10MB LimitReader
- config_merger: defer f.Close() after os.Open to prevent fd leak
- rawsock_linux: syscall.Close(sock) on bind failure to prevent fd leak

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…t DoH response reads

- daedns: add singleflight.Group to LookupIPAddr for concurrent lookup dedup
- daedns: use sync.Pool for 65535-byte UDP DNS buffers instead of per-query allocation
- daedns: cap DoH response with io.LimitReader(resp.Body, 65535)
- control/dns: cap DoH response with io.LimitReader(resp.Body, 65535)

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…e exit, and connectivity check limit

- routing: detect IPv6 with ':' and use /128 instead of /32
- outbound/filter: cache compiled regexp2 patterns in sync.Map
- sniffing: select on ctx.Done() in readStreamOnceAsync to prevent goroutine leak
- connectivity_check: cap debug body read with io.LimitReader(resp.Body, 4096)

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…te audit-fix head

- cmd/run: convert if-else chain to switch for golangci-lint gocritic
- go.mod: replace local outbound with remote olicesx/outbound pseudo-version

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
- Introduced a shared timeout for DNS lookups to prevent blocking indefinitely.
- Added functions to interrupt connections on context cancellation for both TCP and QUIC.
- Enhanced the Router to utilize the new timeout and interruption mechanisms.
- Updated tests to verify the behavior of deduplicated lookups and large UDP responses.
- Modified the DNS forwarder to track consecutive errors and retire after a threshold.
- Adjusted the handling of proxy TCP forwarders to retain them on ordinary transport errors.
- Updated go.mod to use the latest outbound dependency version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

0